Rotating IP Addresses and User Agents
Mar 20, 2023
Rotating IP addresses and user agents are essential techniques for web scraping and crawling that help avoid detection and blocking by websites. By constantly changing the IP address and user agent string with each request, you can make your scraping activity appear more like regular user traffic rather than automated bot behavior. This article will cover the key concepts of IP rotation and user agent rotation, and provide code examples for implementing them in Python.
What is IP Rotation?
IP rotation involves changing the IP address used for making requests to a website. This is typically done by using a pool of proxy servers, each with their own unique IP. With each new request, a different proxy (and thus IP address) is selected from the pool.
There are two main types of proxies that can be used for IP rotation:
Datacenter proxies - These are IP addresses that come from cloud hosting providers and datacenters. They tend to be fast and reliable, but are more easily detected as proxies.
Residential proxies - These are IP addresses from real consumer devices provided by ISPs to homeowners. They appear more legitimate as real user traffic but can be slower and less reliable.
The key benefits of IP rotation are:
Avoiding IP-based rate limiting and blocking
Distributing high request loads across many IPs
Simulating real user traffic from diverse geolocations
What is User Agent Rotation?
A user agent is a string that identifies the client application making the HTTP request, typically containing details like the operating system, web browser, and version. Websites can use the user agent to detect bots, especially if the same user agent is used for many requests.
User agent rotation means varying the user agent string with each request to make them appear to come from different browsers and devices. A pool of user agent strings is maintained, often selected to match the target website's typical visitor profiles.
The key benefits of user agent rotation are:
Avoiding user agent based bot detection
Simulating diverse device types (mobile, desktop)
Rendering content variations served to different browsers
Implementing IP and User Agent Rotation in Python
Let's look at some code examples of how to implement IP and user agent rotation in Python web scraping.
IP Rotation with Scrapy
The popular Scrapy web crawling framework supports IP rotation via the scrapy-rotating-proxies
library. First install it:
pip install scrapy-rotating-proxies
Then in your Scrapy project's settings.py
file, add:
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610
}
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
# ...
]
Alternatively, you can specify a file path with one proxy per line:
ROTATING_PROXY_LIST_PATH = '/path/to/proxies.txt'
The middleware will automatically select a random proxy from the list for each request.
User Agent Rotation with Requests
For making HTTP requests with the requests
library, you can specify a custom user agent string in the headers:
import requests
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
response = requests.get('https://example.com', headers={'User-Agent': user_agent})
To rotate user agents, simply select a random one from a predefined list before each request:
import random
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
# ...
]
def random_user_agent():
return random.choice(user_agents)
response = requests.get('https://example.com', headers={'User-Agent': random_user_agent()})
Conclusion
Rotating IP addresses and user agents are powerful techniques for avoiding anti-bot measures when web scraping. By making each request appear different and distributing them across a range of IP addresses, you can reliably extract data at scale.
The key things to remember are:
Use both datacenter and residential proxies for IP diversity
Maintain a pool of user agent strings and rotate them
Implement IP and user agent rotation between requests, not during sessions
Use proven libraries like Scrapy and requests to simplify the process
With these concepts and code examples, you're well on your way to building robust and undetectable web scrapers. Happy scraping!
Let's get scraping 🚀
Ready to start?
Get scraping now with a free account and $25 in free credits when you sign up.