Extracting Data with Cheerio

Dec 17, 2023

Cheerio is a powerful and efficient library for web scraping in Node.js. It allows you to parse and manipulate HTML documents using a syntax similar to jQuery. With Cheerio, you can easily extract data from websites, even if you don't have extensive programming experience.

In this article, we'll explore how to use Cheerio for web scraping and cover the following key points:

  1. Setting up the environment and installing Cheerio

  2. Loading HTML content into Cheerio

  3. Selecting and extracting data using CSS selectors

  4. Handling pagination and scraping multiple pages

  5. Saving the extracted data to a file

Setting up the Environment

To get started with Cheerio, you need to have Node.js installed on your machine. Once you have Node.js set up, create a new project directory and initialize a new Node.js project using the following commands:

mkdir cheerio-scraper

cd cheerio-scraper

npm init -y

Next, install the necessary dependencies, including Cheerio and Axios (for making HTTP requests):

npm install cheerio axios

Loading HTML Content

To start scraping with Cheerio, you need to load the HTML content of the webpage you want to scrape. You can do this by making an HTTP request to the target URL using Axios. Here's an example:

const axios = require('axios');

const cheerio = require('cheerio');

const url = 'https://example.com';

axios.get(url)

.then(response => {

const html = response.data;

const $ = cheerio.load(html);

// Scraping logic goes here

})

.catch(error => {

console.log('Error:', error);

});

In this code snippet, we use Axios to send a GET request to the specified URL. Once the response is received, we load the HTML content into Cheerio using cheerio.load(). The loaded Cheerio instance is assigned to the $ variable, which we'll use to select and extract data.

Selecting and Extracting Data

Cheerio provides a powerful set of methods for selecting and extracting data from the loaded HTML. It uses CSS selectors to target specific elements on the page. Here are a few examples:

// Select all <h2> elements

const headings = $('h2');

// Select elements with a specific class

const items = $('.item');

// Select elements with a specific attribute

const links = $('a[href^="https://"]');

// Extract text content

const text = $('p').text();

// Extract attribute values

const urls = $('a').map((index, element) => $(element).attr('href')).get();

In these examples, we use various CSS selectors to select elements based on their tag name, class, or attributes. We can then extract the desired data, such as text content or attribute values, using methods like text() and attr().

Handling Pagination

Many websites have content spread across multiple pages. To scrape data from all pages, you need to handle pagination. Here's an example of how you can scrape data from multiple pages using Cheerio:

const baseUrl = 'https://example.com/page/';

const totalPages = 5;

for (let page = 1; page <= totalPages; page++) {

const url = `${baseUrl}${page}`;

axios.get(url)

.then(response => {

const html = response.data;

const $ = cheerio.load(html);

// Scraping logic for each page

// ...

})

.catch(error => {

console.log('Error:', error);

});

}

In this example, we assume that the website uses a URL pattern like https://example.com/page/1, https://example.com/page/2, and so on. We iterate over the desired range of pages, make a request to each page URL, and perform the scraping logic for each page.

Saving the Extracted Data

Once you have extracted the desired data using Cheerio, you may want to save it to a file for further analysis or processing. Here's an example of how you can save the scraped data to a JSON file:

const fs = require('fs');

// Scraping logic

// ...

const scrapedData = [

{ name: 'Item 1', price: 10 },

{ name: 'Item 2', price: 20 },

// ...

];

fs.writeFile('data.json', JSON.stringify(scrapedData, null, 2), err => {

if (err) {

console.log('Error writing file:', err);

} else {

console.log('Data saved to data.json');

}

});

In this code snippet, we assume that you have scraped some data and stored it in the scrapedData array. We use the fs module to write the data to a file named data.json. The JSON.stringify() method is used to convert the data to a JSON string, with optional formatting parameters for readability.

Conclusion

Cheerio is a powerful and easy-to-use library for web scraping in Node.js. It allows you to extract data from websites using familiar CSS selectors and provides a wide range of methods for manipulating and traversing the HTML structure.

In this article, we covered the basics of setting up Cheerio, loading HTML content, selecting and extracting data, handling pagination, and saving the scraped data to a file. With these techniques, you can build robust web scrapers to gather data from various websites efficiently.

Remember to respect the terms of service and robots.txt file of the websites you scrape, and be mindful of the scraping frequency to avoid overwhelming the server.

Happy scraping with Cheerio!

Let's get scraping 🚀

Ready to start?

Get scraping now with a free account and $25 in free credits when you sign up.