Scraping Websites with Complex Navigation
Nov 1, 2023
In this article, we will explore how to scrape websites with complex navigation using headless browsers. Modern websites in 2023 heavily rely on JavaScript frameworks like React, Angular, and Vue.js to render interactive data, which can make web scraping challenging. We'll cover the key points of using browser automation to scrape dynamic web pages, popular tools available, and common challenges and tips.
How Browser Automation Works
Headless browsers like Chrome and Firefox come with built-in automation protocols that allow other programs to control them:
The older WebDriver protocol implemented through an extra browser layer called WebDriver.
The newer Chrome DevTools Protocol (CDP), where the control layer is implicitly available in most modern browsers.
We'll focus on CDP in this article, but the developer experience is similar for both protocols.
Example: Scraping Airbnb Experiences
Let's consider a real-world example of scraping online experience data from https://www.airbnb.com/experiences
. Our task is to fully render a single experience page and return the rendered contents for further processing.
We'll implement this using four popular browser automation tools:
Selenium
Puppeteer
Playwright
ScrapFly API
Selenium
Selenium is one of the oldest browser automation tools, supporting both WebDriver and CDP protocols. It has a large community and supports many programming languages.
Here's how to scrape the Airbnb page using Selenium with Python:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
browser = webdriver.Chrome()
browser.get("https://www.airbnb.com/experiences/272085")
title = (
WebDriverWait(driver=browser, timeout=10)
.until(visibility_of_element_located((By.CSS_SELECTOR, "h1")))
.text
)
content = browser.page_source
browser.close()
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
print(soup.find("h1").text)
Puppeteer
Puppeteer is an asynchronous browser automation library for JavaScript and Python (unofficially). It fully implements the CDP protocol.
Here's the same example using Puppeteer with JavaScript:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://airbnb.com/experiences/272085');
await page.waitForSelector("h1");
await page.content();
await browser.close();
})();
Playwright
Playwright is a browser automation library available in multiple languages, maintained by Microsoft. It supports both synchronous and asynchronous clients.
Here's the example using Playwright with Python:
import asyncio
from playwright.async_api import async_playwright
async def run():
async with async_playwright() as pw:
browser = await pw.chromium.launch()
pages = await browser.new_page()
await page.goto('https://airbnb.com/experiences/272085')
await page.wait_for_selector('h1')
return url, await page.content()
asyncio.run(run())
ScrapFly API
ScrapFly API provides a cloud-based browser automation solution that handles page rendering, session management, and more.
Here's how to use ScrapFly's Python SDK:
import asyncio
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
async def run():
scrapfly = ScrapflyClient(key="YOURKEY", max_concurrency=2)
to_scrape = [
ScrapeConfig(
url="https://www.airbnb.com/experiences/272085",
render_js=True,
wait_for_selector="h1",
),
]
results = await scrapfly.concurrent_scrape(to_scrape)
print(results[0]['content'])
Challenges and Tips
When scraping with browser automation, there are additional challenges to consider:
Avoiding being blocked by fingerprinting
Session persistence
Proxy integration
Scaling and resource optimization
To mitigate these issues:
Use stealth extensions or patches to hide common browser fingerprints
Leverage asynchronous clients for better performance and concurrency
Disable unnecessary resource loading (images, styles) to speed up scraping
Summary
In this article, we explored how to scrape websites with complex navigation using headless browsers. We compared popular tools like Selenium, Puppeteer, Playwright, and ScrapFly API, providing code examples for each. We also discussed common challenges and tips for browser-based web scraping.
Choosing the right tool depends on your specific project requirements, but Playwright and Puppeteer have an advantage with their asynchronous clients. Alternatively, ScrapFly offers a scalable and managed solution.
By understanding the capabilities and challenges of browser automation, you can effectively scrape dynamic websites and extract valuable data.
Let's get scraping 🚀
Ready to start?
Get scraping now with a free account and $25 in free credits when you sign up.