Scraping Websites with Complex Navigation

Nov 1, 2023

In this article, we will explore how to scrape websites with complex navigation using headless browsers. Modern websites in 2023 heavily rely on JavaScript frameworks like React, Angular, and Vue.js to render interactive data, which can make web scraping challenging. We'll cover the key points of using browser automation to scrape dynamic web pages, popular tools available, and common challenges and tips.

How Browser Automation Works

Headless browsers like Chrome and Firefox come with built-in automation protocols that allow other programs to control them:

The older WebDriver protocol implemented through an extra browser layer called WebDriver.
The newer Chrome DevTools Protocol (CDP), where the control layer is implicitly available in most modern browsers.

We'll focus on CDP in this article, but the developer experience is similar for both protocols.

Example: Scraping Airbnb Experiences

Let's consider a real-world example of scraping online experience data from https://www.airbnb.com/experiences. Our task is to fully render a single experience page and return the rendered contents for further processing.

We'll implement this using four popular browser automation tools:

Selenium
Puppeteer
Playwright
ScrapFly API

Selenium

Selenium is one of the oldest browser automation tools, supporting both WebDriver and CDP protocols. It has a large community and supports many programming languages.

Here's how to scrape the Airbnb page using Selenium with Python:

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support.expected_conditions import visibility_of_element_located

browser = webdriver.Chrome()

browser.get("https://www.airbnb.com/experiences/272085")

title = (

WebDriverWait(driver=browser, timeout=10)

.until(visibility_of_element_located((By.CSS_SELECTOR, "h1")))

.text

)

content = browser.page_source

browser.close()

from bs4 import BeautifulSoup

soup = BeautifulSoup(content, "html.parser")

print(soup.find("h1").text)

Puppeteer

Puppeteer is an asynchronous browser automation library for JavaScript and Python (unofficially). It fully implements the CDP protocol.

Here's the same example using Puppeteer with JavaScript:

const puppeteer = require('puppeteer');

(async () => {

const browser = await puppeteer.launch();

const page = await browser.newPage();

await page.goto('https://airbnb.com/experiences/272085');

await page.waitForSelector("h1");

await page.content();

await browser.close();

})();

Playwright

Playwright is a browser automation library available in multiple languages, maintained by Microsoft. It supports both synchronous and asynchronous clients.

Here's the example using Playwright with Python:

import asyncio

from playwright.async_api import async_playwright

async def run():

async with async_playwright() as pw:

browser = await pw.chromium.launch()

pages = await browser.new_page()

await page.goto('https://airbnb.com/experiences/272085')

await page.wait_for_selector('h1')

return url, await page.content()

asyncio.run(run())

ScrapFly API

ScrapFly API provides a cloud-based browser automation solution that handles page rendering, session management, and more.

Here's how to use ScrapFly's Python SDK:

import asyncio

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

async def run():

scrapfly = ScrapflyClient(key="YOURKEY", max_concurrency=2)

to_scrape = [

ScrapeConfig(

url="https://www.airbnb.com/experiences/272085",

render_js=True,

wait_for_selector="h1",

),

]

results = await scrapfly.concurrent_scrape(to_scrape)

print(results[0]['content'])

Challenges and Tips

When scraping with browser automation, there are additional challenges to consider:

Avoiding being blocked by fingerprinting
Session persistence
Proxy integration
Scaling and resource optimization

To mitigate these issues:

Use stealth extensions or patches to hide common browser fingerprints
Leverage asynchronous clients for better performance and concurrency
Disable unnecessary resource loading (images, styles) to speed up scraping

Summary

In this article, we explored how to scrape websites with complex navigation using headless browsers. We compared popular tools like Selenium, Puppeteer, Playwright, and ScrapFly API, providing code examples for each. We also discussed common challenges and tips for browser-based web scraping.

Choosing the right tool depends on your specific project requirements, but Playwright and Puppeteer have an advantage with their asynchronous clients. Alternatively, ScrapFly offers a scalable and managed solution.

By understanding the capabilities and challenges of browser automation, you can effectively scrape dynamic websites and extract valuable data.

Let's get scraping 🚀

Ready to start?

Get scraping now with a free account and $25 in free credits when you sign up.

Latest articles

Resources

Introduction Web Scraping with C# 2024

Oct 19, 2023

Resources

Making HTTP Requests with Axios

Aug 24, 2023