Web Scraping with Residential Proxies: Best Practices for AI Agents
Practical best practices for AI agents using residential proxies to scrape the web. Covers rate limiting, IP rotation, geo-targeting, error handling, and respecting robots.txt.
Residential proxies dramatically improve your AI agent’s ability to access the web, but they’re not a magic bullet. How you use them matters as much as the fact that you’re using them. Poor scraping practices will burn through residential IPs, waste money, and still get your agent blocked. This article covers the practical techniques that separate a reliable, production-grade agent from one that breaks constantly.
Respect robots.txt
Before discussing any optimization, let’s start with the baseline: check robots.txt.
Every well-behaved scraper should read and honor the target site’s robots.txt file. This file tells you which paths the site owner wants crawlers to avoid, and what crawl rate they consider acceptable.
from urllib.robotparser import RobotFileParser
def can_fetch(url: str, user_agent: str = "*") -> bool:
"""Check if the URL is allowed by robots.txt."""
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)
Ignoring robots.txt is not only ethically questionable — it’s also a signal to anti-bot systems. Sites monitor whether visitors respect their crawl directives and factor compliance into their blocking decisions.
Rate Limiting: Slower Is Faster
The single most common mistake in web scraping is going too fast. Residential IPs help you avoid IP-based detection, but they don’t make you invisible. Sites also detect bot behavior through request patterns.
Implement Per-Domain Rate Limits
Don’t apply a single global rate limit. Different sites have different tolerances. A news site with heavy traffic might handle 2 requests per second without blinking. A small e-commerce site might flag you at 1 request every 5 seconds.
import time
from collections import defaultdict
class RateLimiter:
def __init__(self, default_delay: float = 1.0):
self.last_request = defaultdict(float)
self.delays = {}
self.default_delay = default_delay
def set_delay(self, domain: str, delay: float):
self.delays[domain] = delay
def wait(self, domain: str):
delay = self.delays.get(domain, self.default_delay)
elapsed = time.time() - self.last_request[domain]
if elapsed < delay:
time.sleep(delay - elapsed)
self.last_request[domain] = time.time()
limiter = RateLimiter(default_delay=2.0)
limiter.set_delay("example.com", 3.0) # This site is sensitive
limiter.set_delay("api.open-data.org", 0.5) # This one is fine with faster rates
Add Jitter
Consistent timing between requests is a bot signal. Real humans don’t click links at perfectly regular intervals. Add random variation to your delays:
import random
def human_delay(base: float = 2.0, jitter: float = 1.0) -> float:
"""Return a delay that looks more human."""
return base + random.uniform(0, jitter)
IP Rotation Strategy
Residential proxies give you access to a pool of IPs. How you rotate through them affects both success rate and cost.
Rotate Per Request (Default)
For stateless scraping — fetching individual pages where no login or session is needed — use a different IP for each request. This is the default behavior on most residential proxy services and distributes your footprint across many IPs.
Use Sticky Sessions for Multi-Step Flows
Some tasks require maintaining the same IP across multiple requests:
- Logging into a site and navigating authenticated pages.
- Paginating through results where the site tracks your session server-side.
- Adding items to a cart and checking out.
For these, use a sticky session (same IP for a defined window). Switching IPs mid-flow is a strong bot signal.
Don’t Over-Rotate
If you’re making 5 requests to a site over 10 minutes, you don’t need 5 different IPs. A real user would use the same IP for a short browsing session. Over-rotation — using a new IP for every single request when the volume is low — can actually look more suspicious than using one IP consistently.
The general rule: rotate when volume is high, stick when interactions are sequential.
Geo-Targeting Best Practices
Geo-targeting lets you route requests through IPs in specific countries. Use it strategically:
Match the Target’s Expected Audience
If you’re scraping a German e-commerce site, use a German IP. Requests from a US residential IP to a .de domain aren’t necessarily blocked, but they may trigger additional verification or return different content.
# Targeting German residential IP for a German site
result = proxy_request(
url="https://example.de/products",
country="DE"
)
Use Geo-Targeting for Localized Data
Price monitoring, ad verification, search result tracking — all of these depend on geographic location. Specify the country that matches the data you need:
# Get US pricing
us_price = proxy_request("https://store.example.com/product/123", country="US")
# Get UK pricing for the same product
uk_price = proxy_request("https://store.example.com/product/123", country="GB")
Be Aware of IP Availability
Not all countries have the same number of available residential IPs. Major markets (US, UK, DE, FR, BR) have deep pools. Smaller countries may have fewer IPs, so rotating at high volume in those regions might exhaust the available pool faster.
Error Handling and Retry Logic
Things go wrong. Your agent needs to handle failures gracefully without wasting proxy budget.
Categorize Errors
Not all errors are the same. Your retry strategy should depend on the type of failure:
| Status Code | Meaning | Action |
|---|---|---|
| 200 | Success | Process response |
| 301, 302 | Redirect | Follow redirect |
| 403 | Forbidden/blocked | Retry with different IP, slow down |
| 429 | Rate limited | Back off exponentially |
| 500, 502, 503 | Server error | Retry after short delay |
| Connection timeout | Network issue | Retry once |
Implement Exponential Backoff
When you get rate-limited or blocked, don’t immediately retry. Back off:
import time
def scrape_with_backoff(url: str, max_retries: int = 3) -> dict:
for attempt in range(max_retries):
result = proxy_request(url)
if result["statusCode"] == 200:
return result
if result["statusCode"] in (403, 429):
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Blocked/rate-limited. Waiting {wait_time:.1f}s...")
time.sleep(wait_time)
continue
if result["statusCode"] >= 500:
time.sleep(1)
continue
# Non-retryable error
return result
return result # Return last result after all retries
Track Success Rates Per Domain
Monitor which domains your agent successfully scrapes and which ones cause problems. This data helps you tune rate limits and identify sites that need different approaches:
from collections import defaultdict
stats = defaultdict(lambda: {"success": 0, "failed": 0})
def track_result(domain: str, status_code: int):
if 200 <= status_code < 400:
stats[domain]["success"] += 1
else:
stats[domain]["failed"] += 1
def get_success_rate(domain: str) -> float:
s = stats[domain]
total = s["success"] + s["failed"]
return s["success"] / total if total > 0 else 0.0
Headers and Fingerprinting
A residential IP is the foundation, but your request headers matter too.
Set a Realistic User-Agent
Don’t use the default python-requests/2.28.0 or curl/7.88.0. Use a current browser user-agent string:
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
}
Rotate User-Agents
Don’t use the same user-agent string for every request. Maintain a small pool of realistic user-agents and rotate them:
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0",
]
def get_random_ua() -> str:
return random.choice(USER_AGENTS)
Cost Optimization
Residential proxy requests cost money. Here’s how to minimize waste:
Cache Responses
If your agent might request the same URL multiple times within a session, cache the response locally. There’s no reason to pay for the same data twice.
Filter Before You Fetch
If your agent generates a list of URLs to visit, filter out obviously irrelevant ones before making proxy requests. Validate URLs, deduplicate, and prioritize.
Use Conditional Requests
If you’re monitoring a page for changes, use If-Modified-Since or If-None-Match headers. The server returns a 304 Not Modified if nothing changed, which is cheaper to process.
Putting It All Together
A well-built scraping layer for an AI agent combines all of these practices:
- Check
robots.txtbefore scraping a new domain. - Apply per-domain rate limits with jitter.
- Route through residential IPs with appropriate geo-targeting.
- Use realistic, rotating headers.
- Handle errors with categorized retry logic and exponential backoff.
- Cache responses and deduplicate requests.
- Monitor success rates and adjust strategy per domain.
With a service like RentaTube, the proxy infrastructure side is handled for you — pay-per-request pricing at $0.001 USDC, geo-targeting by country code, and a REST API that fits cleanly into any HTTP-based workflow. The practices in this article ensure you use that infrastructure as effectively as possible.
The goal is not just to avoid blocks. It’s to build an agent that accesses the web reliably, efficiently, and responsibly — every time it runs.