Handling CAPTCHAs and Other Anti-Scraping Measures
Jun 27, 2023
When web scraping, you may encounter various anti-scraping techniques used by websites to prevent bots from accessing their data. These measures are put in place to protect the website from slowing down or crashing due to excessive requests. In this article, we'll cover how to handle CAPTCHAs and other common anti-scraping techniques to ensure your web scraping efforts are successful.
IP Blocking
One of the most basic anti-scraping measures is IP blocking. Websites can track the IP address making requests and block it if an unusually high number of requests are detected in a short period of time.
To avoid IP blocking:
Slow down your scraping speed by adding delays between requests
Use a random delay to avoid making requests at a consistent interval
Rotate your IP address periodically by using proxies
Here's an example of adding a random delay in Python:
import time
import random
# Make a request
response = requests.get(url)
# Sleep for a random duration between 1 and 5 seconds
time.sleep(random.randint(1, 5))
CAPTCHAs
CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are challenge-response tests used to determine if the user is human. They often involve identifying distorted text or images.
Solving CAPTCHAs programmatically can be difficult and time-consuming. The best approach is to try to avoid triggering them in the first place by:
Slowing down your request rate
Randomizing your scraping patterns
Using a headless browser to better mimic human behavior
If you do encounter a CAPTCHA, there are services that provide APIs to solve them such as 2captcha and Death by Captcha. Here's an example using the 2captcha API in Python:
from twocaptcha import TwoCaptcha
solver = TwoCaptcha('your_api_key')
# Get the CAPTCHA image URL from the page
captcha_url = 'https://example.com/captcha.jpg'
# Solve the CAPTCHA
result = solver.normal(captcha_url)
# Use the solved CAPTCHA to submit the form
form_data = {'captcha': result['code']}
response = requests.post(url, data=form_data)
User Agent Validation
Websites may check the User-Agent header to determine if the request is coming from a real browser. Make sure to set this header to mimic a common browser.
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}
response = requests.get(url, headers=headers)
Handling Login Forms
Some websites require logging in to access certain pages. You'll need to simulate the login process by submitting the login form with valid credentials.
login_url = 'https://example.com/login'
creds = {'username': 'your_username', 'password': 'your_password'}
# Get the login page and extract CSRF token if present
session = requests.Session()
login_page = session.get(login_url)
csrf_token = extract_csrf(login_page.text)
# Add CSRF token to login data
creds['csrf_token'] = csrf_token
# Post login data
response = session.post(login_url, data=creds)
# Logged in, can now make authenticated requests
response = session.get('https://example.com/private-page')
Handling Dynamic Content (AJAX)
Some websites load content dynamically using JavaScript after the initial page load. To scrape this type of content, you'll need to use a headless browser like Puppeteer or Selenium to fully render the page.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://example.com")
# Wait for dynamic content to load
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "dynamic-content"))
)
dynamic_content = element.text
driver.quit()
Summary
Handling anti-scraping measures is an essential part of effective web scraping. By understanding techniques like IP rotation, CAPTCHA solving, User-Agent spoofing, handling logins, and rendering dynamic content, you can scrape websites more reliably. Always respect website terms of service and robots.txt to ensure you are scraping ethically. With the right approach, you can gather the web data you need while avoiding detection and bans.
Let's get scraping 🚀
Ready to start?
Get scraping now with a free account and $25 in free credits when you sign up.