scraping best-practices ai-agents

Web Scraping with Residential Proxies: Best Practices for AI Agents

Practical best practices for AI agents using residential proxies to scrape the web. Covers rate limiting, IP rotation, geo-targeting, error handling, and respecting robots.txt.

· RentaTube

Residential proxies dramatically improve your AI agent’s ability to access the web, but they’re not a magic bullet. How you use them matters as much as the fact that you’re using them. Poor scraping practices will burn through residential IPs, waste money, and still get your agent blocked. This article covers the practical techniques that separate a reliable, production-grade agent from one that breaks constantly.

Respect robots.txt

Before discussing any optimization, let’s start with the baseline: check robots.txt.

Every well-behaved scraper should read and honor the target site’s robots.txt file. This file tells you which paths the site owner wants crawlers to avoid, and what crawl rate they consider acceptable.

from urllib.robotparser import RobotFileParser

def can_fetch(url: str, user_agent: str = "*") -> bool:
    """Check if the URL is allowed by robots.txt."""
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp.can_fetch(user_agent, url)

Ignoring robots.txt is not only ethically questionable — it’s also a signal to anti-bot systems. Sites monitor whether visitors respect their crawl directives and factor compliance into their blocking decisions.

Rate Limiting: Slower Is Faster

The single most common mistake in web scraping is going too fast. Residential IPs help you avoid IP-based detection, but they don’t make you invisible. Sites also detect bot behavior through request patterns.

Implement Per-Domain Rate Limits

Don’t apply a single global rate limit. Different sites have different tolerances. A news site with heavy traffic might handle 2 requests per second without blinking. A small e-commerce site might flag you at 1 request every 5 seconds.

import time
from collections import defaultdict

class RateLimiter:
    def __init__(self, default_delay: float = 1.0):
        self.last_request = defaultdict(float)
        self.delays = {}
        self.default_delay = default_delay

    def set_delay(self, domain: str, delay: float):
        self.delays[domain] = delay

    def wait(self, domain: str):
        delay = self.delays.get(domain, self.default_delay)
        elapsed = time.time() - self.last_request[domain]
        if elapsed < delay:
            time.sleep(delay - elapsed)
        self.last_request[domain] = time.time()

limiter = RateLimiter(default_delay=2.0)
limiter.set_delay("example.com", 3.0)  # This site is sensitive
limiter.set_delay("api.open-data.org", 0.5)  # This one is fine with faster rates

Add Jitter

Consistent timing between requests is a bot signal. Real humans don’t click links at perfectly regular intervals. Add random variation to your delays:

import random

def human_delay(base: float = 2.0, jitter: float = 1.0) -> float:
    """Return a delay that looks more human."""
    return base + random.uniform(0, jitter)

IP Rotation Strategy

Residential proxies give you access to a pool of IPs. How you rotate through them affects both success rate and cost.

Rotate Per Request (Default)

For stateless scraping — fetching individual pages where no login or session is needed — use a different IP for each request. This is the default behavior on most residential proxy services and distributes your footprint across many IPs.

Use Sticky Sessions for Multi-Step Flows

Some tasks require maintaining the same IP across multiple requests:

  • Logging into a site and navigating authenticated pages.
  • Paginating through results where the site tracks your session server-side.
  • Adding items to a cart and checking out.

For these, use a sticky session (same IP for a defined window). Switching IPs mid-flow is a strong bot signal.

Don’t Over-Rotate

If you’re making 5 requests to a site over 10 minutes, you don’t need 5 different IPs. A real user would use the same IP for a short browsing session. Over-rotation — using a new IP for every single request when the volume is low — can actually look more suspicious than using one IP consistently.

The general rule: rotate when volume is high, stick when interactions are sequential.

Geo-Targeting Best Practices

Geo-targeting lets you route requests through IPs in specific countries. Use it strategically:

Match the Target’s Expected Audience

If you’re scraping a German e-commerce site, use a German IP. Requests from a US residential IP to a .de domain aren’t necessarily blocked, but they may trigger additional verification or return different content.

# Targeting German residential IP for a German site
result = proxy_request(
    url="https://example.de/products",
    country="DE"
)

Use Geo-Targeting for Localized Data

Price monitoring, ad verification, search result tracking — all of these depend on geographic location. Specify the country that matches the data you need:

# Get US pricing
us_price = proxy_request("https://store.example.com/product/123", country="US")

# Get UK pricing for the same product
uk_price = proxy_request("https://store.example.com/product/123", country="GB")

Be Aware of IP Availability

Not all countries have the same number of available residential IPs. Major markets (US, UK, DE, FR, BR) have deep pools. Smaller countries may have fewer IPs, so rotating at high volume in those regions might exhaust the available pool faster.

Error Handling and Retry Logic

Things go wrong. Your agent needs to handle failures gracefully without wasting proxy budget.

Categorize Errors

Not all errors are the same. Your retry strategy should depend on the type of failure:

Status CodeMeaningAction
200SuccessProcess response
301, 302RedirectFollow redirect
403Forbidden/blockedRetry with different IP, slow down
429Rate limitedBack off exponentially
500, 502, 503Server errorRetry after short delay
Connection timeoutNetwork issueRetry once

Implement Exponential Backoff

When you get rate-limited or blocked, don’t immediately retry. Back off:

import time

def scrape_with_backoff(url: str, max_retries: int = 3) -> dict:
    for attempt in range(max_retries):
        result = proxy_request(url)

        if result["statusCode"] == 200:
            return result

        if result["statusCode"] in (403, 429):
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Blocked/rate-limited. Waiting {wait_time:.1f}s...")
            time.sleep(wait_time)
            continue

        if result["statusCode"] >= 500:
            time.sleep(1)
            continue

        # Non-retryable error
        return result

    return result  # Return last result after all retries

Track Success Rates Per Domain

Monitor which domains your agent successfully scrapes and which ones cause problems. This data helps you tune rate limits and identify sites that need different approaches:

from collections import defaultdict

stats = defaultdict(lambda: {"success": 0, "failed": 0})

def track_result(domain: str, status_code: int):
    if 200 <= status_code < 400:
        stats[domain]["success"] += 1
    else:
        stats[domain]["failed"] += 1

def get_success_rate(domain: str) -> float:
    s = stats[domain]
    total = s["success"] + s["failed"]
    return s["success"] / total if total > 0 else 0.0

Headers and Fingerprinting

A residential IP is the foundation, but your request headers matter too.

Set a Realistic User-Agent

Don’t use the default python-requests/2.28.0 or curl/7.88.0. Use a current browser user-agent string:

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
}

Rotate User-Agents

Don’t use the same user-agent string for every request. Maintain a small pool of realistic user-agents and rotate them:

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0",
]

def get_random_ua() -> str:
    return random.choice(USER_AGENTS)

Cost Optimization

Residential proxy requests cost money. Here’s how to minimize waste:

Cache Responses

If your agent might request the same URL multiple times within a session, cache the response locally. There’s no reason to pay for the same data twice.

Filter Before You Fetch

If your agent generates a list of URLs to visit, filter out obviously irrelevant ones before making proxy requests. Validate URLs, deduplicate, and prioritize.

Use Conditional Requests

If you’re monitoring a page for changes, use If-Modified-Since or If-None-Match headers. The server returns a 304 Not Modified if nothing changed, which is cheaper to process.

Putting It All Together

A well-built scraping layer for an AI agent combines all of these practices:

  1. Check robots.txt before scraping a new domain.
  2. Apply per-domain rate limits with jitter.
  3. Route through residential IPs with appropriate geo-targeting.
  4. Use realistic, rotating headers.
  5. Handle errors with categorized retry logic and exponential backoff.
  6. Cache responses and deduplicate requests.
  7. Monitor success rates and adjust strategy per domain.

With a service like RentaTube, the proxy infrastructure side is handled for you — pay-per-request pricing at $0.001 USDC, geo-targeting by country code, and a REST API that fits cleanly into any HTTP-based workflow. The practices in this article ensure you use that infrastructure as effectively as possible.

The goal is not just to avoid blocks. It’s to build an agent that accesses the web reliably, efficiently, and responsibly — every time it runs.

← Back to all articles