Handling CAPTCHAs and Other Anti-Scraping Measures

Jun 27, 2023

When web scraping, you may encounter various anti-scraping techniques used by websites to prevent bots from accessing their data. These measures are put in place to protect the website from slowing down or crashing due to excessive requests. In this article, we'll cover how to handle CAPTCHAs and other common anti-scraping techniques to ensure your web scraping efforts are successful.

IP Blocking

One of the most basic anti-scraping measures is IP blocking. Websites can track the IP address making requests and block it if an unusually high number of requests are detected in a short period of time.

To avoid IP blocking:

  • Slow down your scraping speed by adding delays between requests

  • Use a random delay to avoid making requests at a consistent interval

  • Rotate your IP address periodically by using proxies

Here's an example of adding a random delay in Python:

import time

import random

# Make a request

response = requests.get(url)

# Sleep for a random duration between 1 and 5 seconds

time.sleep(random.randint(1, 5))

CAPTCHAs

CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are challenge-response tests used to determine if the user is human. They often involve identifying distorted text or images.

Solving CAPTCHAs programmatically can be difficult and time-consuming. The best approach is to try to avoid triggering them in the first place by:

  • Slowing down your request rate

  • Randomizing your scraping patterns

  • Using a headless browser to better mimic human behavior

If you do encounter a CAPTCHA, there are services that provide APIs to solve them such as 2captcha and Death by Captcha. Here's an example using the 2captcha API in Python:

from twocaptcha import TwoCaptcha

solver = TwoCaptcha('your_api_key')

# Get the CAPTCHA image URL from the page

captcha_url = 'https://example.com/captcha.jpg'

# Solve the CAPTCHA

result = solver.normal(captcha_url)

# Use the solved CAPTCHA to submit the form

form_data = {'captcha': result['code']}

response = requests.post(url, data=form_data)

User Agent Validation

Websites may check the User-Agent header to determine if the request is coming from a real browser. Make sure to set this header to mimic a common browser.

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'

}

response = requests.get(url, headers=headers)

Handling Login Forms

Some websites require logging in to access certain pages. You'll need to simulate the login process by submitting the login form with valid credentials.

login_url = 'https://example.com/login'

creds = {'username': 'your_username', 'password': 'your_password'}

# Get the login page and extract CSRF token if present

session = requests.Session()

login_page = session.get(login_url)

csrf_token = extract_csrf(login_page.text)

# Add CSRF token to login data

creds['csrf_token'] = csrf_token

# Post login data

response = session.post(login_url, data=creds)

# Logged in, can now make authenticated requests

response = session.get('https://example.com/private-page')

Handling Dynamic Content (AJAX)

Some websites load content dynamically using JavaScript after the initial page load. To scrape this type of content, you'll need to use a headless browser like Puppeteer or Selenium to fully render the page.

from selenium import webdriver

driver = webdriver.Chrome()

driver.get("https://example.com")

# Wait for dynamic content to load

element = WebDriverWait(driver, 10).until(

EC.presence_of_element_located((By.ID, "dynamic-content"))

)

dynamic_content = element.text

driver.quit()

Summary

Handling anti-scraping measures is an essential part of effective web scraping. By understanding techniques like IP rotation, CAPTCHA solving, User-Agent spoofing, handling logins, and rendering dynamic content, you can scrape websites more reliably. Always respect website terms of service and robots.txt to ensure you are scraping ethically. With the right approach, you can gather the web data you need while avoiding detection and bans.

Let's get scraping 🚀

Ready to start?

Get scraping now with a free account and $25 in free credits when you sign up.