tutorial scraping ai-agents

Build Your First AI Web Scraper in 2026: A Beginner's Guide (With Proxy Setup)

Step-by-step tutorial to build an AI-powered web scraper using Python, LLMs, and residential proxies. Includes code examples, proxy integration, and anti-bot handling.

· RentaTube

Traditional web scraping relies on brittle CSS selectors and XPath expressions that break every time a website changes its HTML structure. AI-powered scraping takes a different approach: instead of writing rules to extract specific elements, you feed raw page content to a large language model (LLM) and ask it to extract the data you need. The LLM understands the page semantically, making your scraper dramatically more resilient to layout changes.

This guide walks you through building your first AI web scraper from scratch. You will set up the Python environment, fetch web pages through residential proxies, parse the content with an LLM, and handle the common obstacles that trip up beginners. By the end, you will have a working scraper that can extract structured data from almost any website.

What You Will Need

  • Python 3.10 or later installed on your machine.
  • An OpenAI or Anthropic API key for LLM-based parsing.
  • A residential proxy service for reliable web access (we will use RentaTube in the examples).
  • Basic Python knowledge — you should be comfortable with functions, dictionaries, and installing packages.

Step 1: Set Up Your Environment

Create a project directory and install the dependencies:

mkdir ai-scraper && cd ai-scraper
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

pip install requests beautifulsoup4 openai anthropic

Create a .env file for your API keys (never commit this to version control):

OPENAI_API_KEY=sk-your-openai-key-here
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key-here
RENTATUBE_API_KEY=rt_live_your-rentatube-api-key-here

And a simple config loader:

# config.py
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
RENTATUBE_API_KEY = os.getenv("RENTATUBE_API_KEY")
RENTATUBE_API_URL = "https://api.rentatube.dev/api/v1/proxy"

You will also need python-dotenv:

pip install python-dotenv

Step 2: Fetch Web Pages Through a Proxy

The first challenge is reliably getting the HTML content of web pages. Many sites block requests from datacenter IPs, return CAPTCHAs, or serve different content to detected bots. Residential proxies solve this by routing your request through a real ISP-assigned IP address, making it indistinguishable from normal browser traffic.

Here is a function that fetches a page through a residential proxy:

# fetcher.py
import requests
import random
import time
from config import RENTATUBE_API_KEY, RENTATUBE_API_URL

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
]

def fetch_page(url: str, country: str = "US", max_retries: int = 3) -> str | None:
    """
    Fetch a web page through a residential proxy.
    Returns the HTML content or None if all retries fail.
    """
    for attempt in range(max_retries):
        try:
            response = requests.post(
                RENTATUBE_API_URL,
                headers={
                    "X-API-Key": RENTATUBE_API_KEY,
                    "Content-Type": "application/json",
                },
                json={
                    "request": {
                        "url": url,
                        "method": "GET",
                        "headers": {
                            "User-Agent": random.choice(USER_AGENTS),
                            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                            "Accept-Language": "en-US,en;q=0.9",
                        },
                    },
                    "country": country,
                },
                timeout=30,
            )

            data = response.json()
            status = data.get("statusCode", 0)

            if status == 200:
                return data.get("body", "")

            if status in (403, 429, 503):
                wait = (2 ** attempt) + random.uniform(0, 2)
                print(f"  Attempt {attempt + 1}: Got {status}, retrying in {wait:.1f}s...")
                time.sleep(wait)
                continue

        except requests.exceptions.RequestException as e:
            print(f"  Attempt {attempt + 1}: Request error: {e}")
            time.sleep(2)

    return None

Why Not Just Use requests Directly?

You might wonder why you cannot just call requests.get(url) without a proxy. For many websites, you can — and you should start there for testing. But in production, direct requests fail for several reasons:

  • IP blocks: Your server’s datacenter IP gets flagged after a handful of requests.
  • Geo-restrictions: Some content is only available from specific countries.
  • Rate limiting: Without IP rotation, hitting the same site repeatedly quickly triggers blocks.

A residential proxy rotates your IP with every request, making each request appear to come from a different household. This is the foundation of reliable scraping at any meaningful scale.

Step 3: Clean the HTML

Raw HTML is full of noise — navigation menus, scripts, stylesheets, footers, ads. Sending all of this to an LLM wastes tokens and money. Clean the HTML down to the meaningful content first:

# cleaner.py
from bs4 import BeautifulSoup

def clean_html(html: str) -> str:
    """
    Strip HTML down to meaningful text content.
    Removes scripts, styles, nav elements, and other noise.
    """
    soup = BeautifulSoup(html, "html.parser")

    # Remove elements that never contain useful content
    for tag in soup.find_all(["script", "style", "nav", "footer", "header", "aside", "noscript", "iframe"]):
        tag.decompose()

    # Remove hidden elements
    for tag in soup.find_all(attrs={"style": lambda s: s and "display:none" in s.replace(" ", "")}):
        tag.decompose()

    # Get the main content area if it exists
    main = soup.find("main") or soup.find("article") or soup.find("div", {"role": "main"})
    if main:
        text = main.get_text(separator="\n", strip=True)
    else:
        text = soup.get_text(separator="\n", strip=True)

    # Remove excessive blank lines
    lines = [line.strip() for line in text.splitlines() if line.strip()]
    return "\n".join(lines)

This function typically reduces the input size by 70-90%, which directly reduces your LLM API costs.

Step 4: Extract Data with an LLM

This is where the AI in “AI web scraper” comes in. Instead of writing CSS selectors or regex patterns to extract specific data, you send the cleaned text to an LLM with a structured prompt.

Using OpenAI

# extractor_openai.py
import json
from openai import OpenAI
from config import OPENAI_API_KEY

client = OpenAI(api_key=OPENAI_API_KEY)

def extract_with_openai(text: str, schema: dict, instructions: str = "") -> dict | None:
    """
    Extract structured data from text using OpenAI.

    Args:
        text: The cleaned page text.
        schema: A dict describing the fields to extract.
        instructions: Optional extra instructions for the LLM.

    Returns:
        A dict matching the schema, or None on failure.
    """
    schema_description = json.dumps(schema, indent=2)

    prompt = f"""Extract the following data from the text below. Return ONLY a valid JSON object matching this schema:

{schema_description}

{f"Additional instructions: {instructions}" if instructions else ""}

If a field cannot be found, use null.

TEXT:
{text[:8000]}"""  # Limit text to control token usage

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a data extraction assistant. Return only valid JSON, no markdown formatting."},
                {"role": "user", "content": prompt},
            ],
            temperature=0,
            max_tokens=1000,
        )

        content = response.choices[0].message.content.strip()
        # Remove markdown code fences if present
        if content.startswith("```"):
            content = content.split("\n", 1)[1].rsplit("```", 1)[0]

        return json.loads(content)

    except (json.JSONDecodeError, Exception) as e:
        print(f"Extraction failed: {e}")
        return None

Using Anthropic Claude

# extractor_claude.py
import json
import anthropic
from config import ANTHROPIC_API_KEY

client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

def extract_with_claude(text: str, schema: dict, instructions: str = "") -> dict | None:
    """
    Extract structured data from text using Claude.

    Args:
        text: The cleaned page text.
        schema: A dict describing the fields to extract.
        instructions: Optional extra instructions for the LLM.

    Returns:
        A dict matching the schema, or None on failure.
    """
    schema_description = json.dumps(schema, indent=2)

    prompt = f"""Extract the following data from the text below. Return ONLY a valid JSON object matching this schema:

{schema_description}

{f"Additional instructions: {instructions}" if instructions else ""}

If a field cannot be found, use null.

TEXT:
{text[:8000]}"""

    try:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1000,
            messages=[
                {"role": "user", "content": prompt},
            ],
            system="You are a data extraction assistant. Return only valid JSON, no markdown formatting or explanation.",
            temperature=0,
        )

        content = response.content[0].text.strip()
        if content.startswith("```"):
            content = content.split("\n", 1)[1].rsplit("```", 1)[0]

        return json.loads(content)

    except (json.JSONDecodeError, Exception) as e:
        print(f"Extraction failed: {e}")
        return None

Defining Extraction Schemas

The schema tells the LLM what data you want. Here are examples for common use cases:

# schemas.py

PRODUCT_SCHEMA = {
    "name": "string - the product name",
    "price": "number - the current price in dollars",
    "original_price": "number or null - the original price before discount",
    "currency": "string - the currency code (USD, EUR, etc.)",
    "availability": "string - in stock, out of stock, or limited",
    "rating": "number or null - the average rating out of 5",
    "review_count": "number or null - the total number of reviews",
    "description": "string - a brief product description",
}

JOB_LISTING_SCHEMA = {
    "title": "string - the job title",
    "company": "string - the company name",
    "location": "string - the job location",
    "salary_range": "string or null - the salary range if listed",
    "employment_type": "string - full-time, part-time, contract, etc.",
    "experience_level": "string - entry, mid, senior, etc.",
    "posted_date": "string or null - when the job was posted",
    "key_requirements": "list of strings - top 5 requirements",
}

ARTICLE_SCHEMA = {
    "title": "string - the article title",
    "author": "string or null - the author name",
    "published_date": "string or null - the publication date",
    "summary": "string - a 2-3 sentence summary",
    "main_topics": "list of strings - the main topics covered",
}

Step 5: Put It All Together

Now combine the fetcher, cleaner, and extractor into a complete scraping pipeline:

# scraper.py
from fetcher import fetch_page
from cleaner import clean_html
from extractor_openai import extract_with_openai
# Or: from extractor_claude import extract_with_claude
from schemas import PRODUCT_SCHEMA

def scrape_product(url: str, country: str = "US") -> dict | None:
    """
    Scrape a product page and extract structured data.

    Args:
        url: The product page URL.
        country: Country code for geo-targeting the proxy.

    Returns:
        A dict with extracted product data, or None on failure.
    """
    print(f"Fetching: {url}")
    html = fetch_page(url, country=country)

    if not html:
        print("  Failed to fetch page")
        return None

    print(f"  Fetched {len(html)} bytes, cleaning...")
    cleaned = clean_html(html)
    print(f"  Cleaned to {len(cleaned)} chars")

    print("  Extracting data with LLM...")
    data = extract_with_openai(
        text=cleaned,
        schema=PRODUCT_SCHEMA,
        instructions="Extract the primary product information. If multiple products appear, extract only the main/featured one.",
    )

    if data:
        data["source_url"] = url
        print(f"  Extracted: {data.get('name', 'Unknown')} - {data.get('price', 'N/A')}")

    return data


if __name__ == "__main__":
    # Example: scrape a product page
    result = scrape_product("https://example.com/product/12345")
    if result:
        import json
        print("\nExtracted Data:")
        print(json.dumps(result, indent=2))

Step 6: Scraping Multiple Pages

A single-page scraper is useful for testing, but real applications need to process multiple URLs. Here is how to build a batch scraper with basic rate limiting:

# batch_scraper.py
import json
import time
import random
from scraper import scrape_product

def scrape_product_list(urls: list[dict], output_file: str = "results.json") -> list[dict]:
    """
    Scrape a list of product URLs with rate limiting.

    Each item: {"url": "https://...", "country": "US"}
    """
    results = []

    for i, item in enumerate(urls):
        print(f"\n[{i + 1}/{len(urls)}]")
        data = scrape_product(item["url"], country=item.get("country", "US"))

        if data:
            results.append(data)

        # Rate limit: wait between requests to be respectful
        if i < len(urls) - 1:
            delay = random.uniform(2.0, 5.0)
            print(f"  Waiting {delay:.1f}s before next request...")
            time.sleep(delay)

    # Save results
    with open(output_file, "w") as f:
        json.dump(results, f, indent=2)

    print(f"\nDone. {len(results)}/{len(urls)} products extracted.")
    print(f"Results saved to {output_file}")
    return results


if __name__ == "__main__":
    urls = [
        {"url": "https://example.com/product/1", "country": "US"},
        {"url": "https://example.com/product/2", "country": "US"},
        {"url": "https://example.de/product/3", "country": "DE"},
    ]
    scrape_product_list(urls)

When to Use Residential vs Datacenter Proxies

Not every scraping target requires a residential proxy. Here is a quick decision framework:

Use residential proxies when:

  • The target site uses anti-bot protection (Cloudflare, Akamai, DataDome, PerimeterX).
  • You need to access geo-restricted content.
  • The site serves different content to detected bots.
  • You are scraping at volume and need high success rates.
  • The target is a major ecommerce, social media, or news site.

Datacenter proxies may suffice when:

  • The target is a public API with no bot detection.
  • You have explicit permission to scrape.
  • The target is your own infrastructure.
  • Speed matters more than stealth (latency is typically 1-20ms vs 50-200ms for residential).

For most AI web scraping use cases — especially when your scraper visits diverse, unknown sites at runtime — residential proxies are the safer default. The cost difference per successful request is small when you factor in the failure rate of datacenter proxies on protected sites.

Handling Common Anti-Bot Measures

Even with residential proxies, you will encounter obstacles. Here is how to handle the most common ones.

CAPTCHAs

If you are hitting CAPTCHAs regularly, your scraping pattern is too aggressive. Slow down, add more randomization to your timing, and make sure your request headers are realistic. CAPTCHAs are a warning before a full block — heed them.

If occasional CAPTCHAs are unavoidable, you have two options:

  1. Skip and retry: Mark the URL for retry later. A different IP and time window often bypasses the challenge.
  2. CAPTCHA-solving services: Third-party services can solve CAPTCHAs, but they add cost ($1-3 per 1,000 solves) and latency (10-30 seconds per solve).

Option 1 is almost always better. If you are solving CAPTCHAs at scale, something in your approach needs fixing.

JavaScript-Rendered Content

Some sites load content dynamically with JavaScript. Your fetcher gets HTML, but the actual data is loaded by client-side scripts. Signs of this:

  • The extracted text is mostly navigation and boilerplate, with the main content missing.
  • You see placeholder elements like “Loading…” or empty product cards.
  • The LLM returns null for most fields despite the page being real.

Solutions:

  • Check for API endpoints: Many JS-rendered sites load data from internal APIs. Open browser DevTools, watch the Network tab, and look for JSON API calls. Scraping the API directly is faster, cheaper, and more reliable.
  • Use a headless browser: Tools like Playwright or Puppeteer render the full page including JavaScript. This is slower and more expensive but works for stubborn sites.

Rate Limit (429) Responses

A 429 response means you are sending too many requests. Your retry logic should back off exponentially:

def calculate_backoff(attempt: int, base: float = 2.0) -> float:
    """Calculate exponential backoff with jitter."""
    return (base ** attempt) + random.uniform(0, base)

# Attempt 0: 1-3s, Attempt 1: 2-4s, Attempt 2: 4-6s, Attempt 3: 8-10s

Also consider reducing your overall scraping rate for that domain going forward. A 429 is the site telling you its threshold — respect it.

Redirect Loops and Soft Blocks

Some sites do not return an error code. Instead, they redirect scrapers to a different page (a “soft block”). Signs:

  • You get a 200 response, but the content is a generic landing page instead of the product page.
  • The URL in the response does not match the URL you requested.
  • The LLM consistently extracts the same generic data from different product URLs.

Detect this by checking for expected content markers after fetching:

def is_valid_product_page(html: str) -> bool:
    """Basic check that the response looks like a real product page."""
    indicators = ["add to cart", "price", "product", "buy now", "in stock"]
    text = html.lower()
    matches = sum(1 for indicator in indicators if indicator in text)
    return matches >= 2  # At least 2 indicators should be present

Cost Optimization Tips

AI-powered scraping has two cost components: proxy requests and LLM API calls. Here is how to manage both.

Minimize LLM Token Usage

The clean_html function already helps, but you can go further:

  • Truncate intelligently: Product data is almost always in the first half of the page. Truncating to 4,000-8,000 characters covers most cases.
  • Use smaller models: For straightforward extraction, gpt-4o-mini or claude-sonnet work just as well as larger models at a fraction of the cost.
  • Cache LLM results: If you re-scrape a page and the content has not changed (check with a hash), reuse the previous extraction.

Minimize Proxy Requests

  • Check robots.txt once per domain and cache the result. Do not re-fetch it for every URL.
  • Deduplicate URLs before scraping. Normalize URLs by removing tracking parameters.
  • Use conditional requests: Send If-Modified-Since headers when re-scraping pages you have fetched before. A 304 Not Modified response saves you an LLM call.

Pay-Per-Request Pricing

With a pay-per-request proxy like RentaTube, cost optimization is straightforward: every request you eliminate saves exactly $0.001. There is no sunk cost from an unused subscription, no pressure to “use up” a monthly quota, and no surprise overages. You pay for what you use.

For a beginner scraping project making 1,000-5,000 requests per month, that is $1-5 in proxy costs. Even adding LLM API costs (~$0.50-2.00 for the same volume with a small model), the total cost of running an AI-powered scraper is remarkably low.

Complete Project Structure

Here is the final project layout:

ai-scraper/
  .env                    # API keys (never commit this)
  config.py               # Configuration loader
  fetcher.py              # Proxy-based page fetching
  cleaner.py              # HTML cleaning and text extraction
  extractor_openai.py     # OpenAI-based data extraction
  extractor_claude.py     # Claude-based data extraction
  schemas.py              # Data extraction schemas
  scraper.py              # Single-page scraping pipeline
  batch_scraper.py        # Multi-page batch scraping
  requirements.txt        # Dependencies

Your requirements.txt:

requests
beautifulsoup4
openai
anthropic
python-dotenv

Where to Go from Here

You now have a working AI web scraper with proxy support. Here are natural next steps as you build on this foundation:

  • Add a database: Store results in SQLite or PostgreSQL instead of JSON files. Track historical data over time.
  • Schedule scraping runs: Use cron, Celery, or a managed scheduler to run your scraper on a schedule.
  • Build a monitoring dashboard: Track success rates, extraction quality, and costs per domain.
  • Add more schemas: Adapt the scraper for different data types — job listings, real estate, news articles, restaurant menus.
  • Implement parallel scraping: Use asyncio or concurrent.futures to scrape multiple pages concurrently (with appropriate rate limiting per domain).

The combination of LLM-based parsing and residential proxies makes web scraping more accessible and more reliable than it has ever been. The LLM handles the brittle parsing problem. The proxy handles the access problem. Your job is connecting them with clean, well-structured code.

Ready to start scraping? Sign up for a RentaTube API key at rentatube.dev — it takes 30 seconds with just an Ethereum wallet, and your first $1 of USDC gets you 1,000 proxy requests. No subscription, no minimum, no commitment.

← Back to all articles