Parsing HTML with Cheerio

Nov 29, 2023

Parsing HTML with Cheerio

Cheerio is a fast and lightweight library that allows you to parse and manipulate HTML documents using a jQuery-like syntax in Node.js. It provides a convenient way to extract data from web pages by traversing the DOM and selecting elements using familiar CSS selectors.

In this article, we'll cover the key concepts of using Cheerio to parse HTML, including:

  • Installing and setting up Cheerio

  • Loading HTML into Cheerio

  • Selecting elements and extracting data

  • Manipulating the parsed HTML

  • Rendering the modified HTML

Installing Cheerio

To get started with Cheerio, first install it in your Node.js project using npm:

npm install cheerio

Loading HTML

Cheerio works by loading HTML content into a virtual DOM that you can then query and manipulate. There are a few ways to load HTML into Cheerio:

  1. Load HTML from a string:

const cheerio = require('cheerio');

const $ = cheerio.load('<h1>Hello, Cheerio!</h1>');

  1. Load HTML from a file:

const fs = require('fs');

const cheerio = require('cheerio');

const html = fs.readFileSync('index.html');

const $ = cheerio.load(html);

  1. Load HTML from a URL using a HTTP client like axios or got:

const axios = require('axios');

const cheerio = require('cheerio');

const getHtml = async () => {

const { data } = await axios.get('https://example.com');

return cheerio.load(data);

};

The cheerio.load() function returns a Cheerio instance that wraps the parsed document. By convention, it's assigned to the $ variable to mimic jQuery syntax.

Selecting Elements

Cheerio uses the same CSS selector syntax as jQuery to select elements from the virtual DOM. For example:

$('h1') // select all <h1> elements

$('#main') // select element with id "main"

$('.featured') // select all elements with class "featured"

$('article > p') // select <p> direct children of <article>

$('ul li') // select <li> descendants of <ul>

You can also chain selectors and use pseudo-selectors like :first, :last, :eq(index), :contains(text), etc.

Extracting Data

Once you've selected elements, Cheerio provides methods to extract data from them:

// Get the text content

const text = $('h1').text();

// Get an attribute value

const src = $('img').attr('src');

// Get the HTML content

const html = $('div').html();

// Loop through a collection

$('li').each((i, el) => {

console.log($(el).text());

});

// Get the combined text of children

const combinedText = $('div').children().text();

Manipulating Elements

Cheerio allows you to modify the parsed HTML using methods like:

// Append content

$('ul').append('<li>New item</li>');

// Prepend content

$('ul').prepend('<li>First item</li>');

// Remove elements

$('li').remove('.promo');

// Add/remove/toggle classes

$('h1').addClass('highlight');

$('h2').removeClass('hidden');

$('button').toggleClass('active');

// Modify attributes

$('a').attr('href', 'https://example.com');

These operations modify the virtual DOM without affecting the original HTML source.

Rendering HTML

After manipulating the parsed HTML, you can render it back to an HTML string using the html() method:

$.html(); // returns the outer HTML of the entire document

$('h1').html(); // returns the inner HTML of the first <h1>

Summary

Cheerio is a powerful library for parsing and manipulating HTML in Node.js using familiar jQuery syntax. With Cheerio you can:

  • Load HTML from strings, files, or URLs

  • Select elements using CSS selectors

  • Extract data like text, attributes, and HTML

  • Manipulate elements by adding, removing, or modifying their content and attributes

  • Render the parsed HTML back to a string

Cheerio makes it easy to scrape websites and process HTML documents in your Node.js applications. I hope this article has helped you understand the key concepts and how to get started using Cheerio. Let me know if you have any other questions!

Let's get scraping 🚀

Ready to start?

Get scraping now with a free account and $25 in free credits when you sign up.