How to crawl a website sitemap and scrape all pages with with Scrapy and Python

Feb 27, 2024

Web scraping is a powerful technique for extracting data from websites. Often you need to crawl an entire website to collect all the desired data. One efficient approach is to use the website's sitemap to discover all the pages you want to scrape. In this article, we'll cover how to crawl a website sitemap and scrape all the pages using the Scrapy framework in Python.

What is a Sitemap?

A sitemap is a file where a website can list all the pages that are available for crawling. It's usually found at the URL <website_url>/sitemap.xml. Sitemaps are extremely useful for web crawlers because they:

  • Provide a list of pages that are allowed to be crawled

  • Specify metadata about each page like last modified date, change frequency, priority

  • Help ensure the crawler finds all important pages

Most major websites provide a sitemap, so it's a great starting point for crawling.

Scraping with Scrapy

Scrapy is a popular open-source framework for building web spiders to crawl and extract structured data from websites. It handles many challenges of web scraping at scale:

  • Concurrent requests

  • Crawling pages by following links

  • Extracting, transforming and saving data

  • Extensible architecture for handling middleware, pipelines, etc.

While you can build web scrapers from scratch using libraries like Requests and BeautifulSoup, Scrapy provides an efficient and maintainable foundation.

Crawling a Sitemap with Scrapy

Scrapy provides a built-in SitemapSpider class that makes it easy to crawl a sitemap. Here's a basic example:

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):

name = 'myspider'

sitemap_urls = ['https://example.com/sitemap.xml']

def parse(self, response):

# parse and extract data from each page

pass

The key parts are:

  • Inherit from SitemapSpider

  • Specify the sitemap URL(s) in sitemap_urls

  • Implement a parse method to extract data from each page

Scrapy will read the sitemap(s), follow each link, and call the parse method with the response for each page.

Filtering Sitemap URLs

Sitemaps often contain URLs that you don't need for your scraping task. You can filter the URLs using the sitemap_rules attribute. For example:

class MySpider(SitemapSpider):

name = 'myspider'

sitemap_urls = ['https://example.com/sitemap.xml']

sitemap_rules = [

('/products/', 'parse_product'),

('/articles/', 'parse_article'),

]

def parse_product(self, response):

# parse product pages

pass

def parse_article(self, response):

# parse article pages

pass

This tells Scrapy to only follow URLs that match /products/ or /articles/ and call the corresponding parse_* method.

Extracting Data

Once Scrapy fetches each page, you need to extract the desired data using CSS or XPath selectors. For example, to extract the title and price from a product page:

def parse_product(self, response):

yield {

'title': response.css('h1::text').get(),

'price': response.css('.price::text').get(),

}

Scrapy provides many convenient methods for extracting and cleaning the data. The data gets returned from the parse method and Scrapy takes care of collecting and saving it.

Saving Data

By default, Scrapy just outputs the scraped data to the console. To save it to a file, you can run the spider with:

scrapy crawl myspider -O products.json

This will save the scraped data to products.json. Scrapy supports JSON, JSON lines, CSV, and XML formats out of the box.

For more advanced use cases, you can write a custom Item Pipeline to process and store the scraped data, e.g. in a database.

Summary

Crawling a website sitemap is an efficient way to find all the pages you need to scrape. The Scrapy framework makes this easy by:

  • Providing a built-in SitemapSpider to handle the crawling

  • Supporting filtering of sitemap URLs

  • Allowing customization of the data extraction and storage

With just a few lines of code, you can build a complete web scraping spider to extract data from an entire website. Scrapy takes care of the heavy lifting, allowing you to focus on parsing and saving the data you need.

Let's get scraping 🚀

Ready to start?

Get scraping now with a free account and $25 in free credits when you sign up.