Parsing HTML with Cheerio
Nov 29, 2023
Parsing HTML with Cheerio
Cheerio is a fast and lightweight library that allows you to parse and manipulate HTML documents using a jQuery-like syntax in Node.js. It provides a convenient way to extract data from web pages by traversing the DOM and selecting elements using familiar CSS selectors.
In this article, we'll cover the key concepts of using Cheerio to parse HTML, including:
Installing and setting up Cheerio
Loading HTML into Cheerio
Selecting elements and extracting data
Manipulating the parsed HTML
Rendering the modified HTML
Installing Cheerio
To get started with Cheerio, first install it in your Node.js project using npm:
npm install cheerio
Loading HTML
Cheerio works by loading HTML content into a virtual DOM that you can then query and manipulate. There are a few ways to load HTML into Cheerio:
Load HTML from a string:
const cheerio = require('cheerio');
const $ = cheerio.load('<h1>Hello, Cheerio!</h1>');
Load HTML from a file:
const fs = require('fs');
const cheerio = require('cheerio');
const html = fs.readFileSync('index.html');
const $ = cheerio.load(html);
Load HTML from a URL using a HTTP client like
axios
orgot
:
const axios = require('axios');
const cheerio = require('cheerio');
const getHtml = async () => {
const { data } = await axios.get('https://example.com');
return cheerio.load(data);
};
The cheerio.load()
function returns a Cheerio instance that wraps the parsed document. By convention, it's assigned to the $
variable to mimic jQuery syntax.
Selecting Elements
Cheerio uses the same CSS selector syntax as jQuery to select elements from the virtual DOM. For example:
$('h1') // select all <h1> elements
$('#main') // select element with id "main"
$('.featured') // select all elements with class "featured"
$('article > p') // select <p> direct children of <article>
$('ul li') // select <li> descendants of <ul>
You can also chain selectors and use pseudo-selectors like :first
, :last
, :eq(index)
, :contains(text)
, etc.
Extracting Data
Once you've selected elements, Cheerio provides methods to extract data from them:
// Get the text content
const text = $('h1').text();
// Get an attribute value
const src = $('img').attr('src');
// Get the HTML content
const html = $('div').html();
// Loop through a collection
$('li').each((i, el) => {
console.log($(el).text());
});
// Get the combined text of children
const combinedText = $('div').children().text();
Manipulating Elements
Cheerio allows you to modify the parsed HTML using methods like:
// Append content
$('ul').append('<li>New item</li>');
// Prepend content
$('ul').prepend('<li>First item</li>');
// Remove elements
$('li').remove('.promo');
// Add/remove/toggle classes
$('h1').addClass('highlight');
$('h2').removeClass('hidden');
$('button').toggleClass('active');
// Modify attributes
$('a').attr('href', 'https://example.com');
These operations modify the virtual DOM without affecting the original HTML source.
Rendering HTML
After manipulating the parsed HTML, you can render it back to an HTML string using the html()
method:
$.html(); // returns the outer HTML of the entire document
$('h1').html(); // returns the inner HTML of the first <h1>
Summary
Cheerio is a powerful library for parsing and manipulating HTML in Node.js using familiar jQuery syntax. With Cheerio you can:
Load HTML from strings, files, or URLs
Select elements using CSS selectors
Extract data like text, attributes, and HTML
Manipulate elements by adding, removing, or modifying their content and attributes
Render the parsed HTML back to a string
Cheerio makes it easy to scrape websites and process HTML documents in your Node.js applications. I hope this article has helped you understand the key concepts and how to get started using Cheerio. Let me know if you have any other questions!
Let's get scraping 🚀
Ready to start?
Get scraping now with a free account and $25 in free credits when you sign up.