Build Your First AI Web Scraper in 2026: A Beginner's Guide (With Proxy Setup)
Step-by-step tutorial to build an AI-powered web scraper using Python, LLMs, and residential proxies. Includes code examples, proxy integration, and anti-bot handling.
Traditional web scraping relies on brittle CSS selectors and XPath expressions that break every time a website changes its HTML structure. AI-powered scraping takes a different approach: instead of writing rules to extract specific elements, you feed raw page content to a large language model (LLM) and ask it to extract the data you need. The LLM understands the page semantically, making your scraper dramatically more resilient to layout changes.
This guide walks you through building your first AI web scraper from scratch. You will set up the Python environment, fetch web pages through residential proxies, parse the content with an LLM, and handle the common obstacles that trip up beginners. By the end, you will have a working scraper that can extract structured data from almost any website.
What You Will Need
- Python 3.10 or later installed on your machine.
- An OpenAI or Anthropic API key for LLM-based parsing.
- A residential proxy service for reliable web access (we will use RentaTube in the examples).
- Basic Python knowledge — you should be comfortable with functions, dictionaries, and installing packages.
Step 1: Set Up Your Environment
Create a project directory and install the dependencies:
mkdir ai-scraper && cd ai-scraper
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install requests beautifulsoup4 openai anthropic
Create a .env file for your API keys (never commit this to version control):
OPENAI_API_KEY=sk-your-openai-key-here
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key-here
RENTATUBE_API_KEY=rt_live_your-rentatube-api-key-here
And a simple config loader:
# config.py
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
RENTATUBE_API_KEY = os.getenv("RENTATUBE_API_KEY")
RENTATUBE_API_URL = "https://api.rentatube.dev/api/v1/proxy"
You will also need python-dotenv:
pip install python-dotenv
Step 2: Fetch Web Pages Through a Proxy
The first challenge is reliably getting the HTML content of web pages. Many sites block requests from datacenter IPs, return CAPTCHAs, or serve different content to detected bots. Residential proxies solve this by routing your request through a real ISP-assigned IP address, making it indistinguishable from normal browser traffic.
Here is a function that fetches a page through a residential proxy:
# fetcher.py
import requests
import random
import time
from config import RENTATUBE_API_KEY, RENTATUBE_API_URL
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
]
def fetch_page(url: str, country: str = "US", max_retries: int = 3) -> str | None:
"""
Fetch a web page through a residential proxy.
Returns the HTML content or None if all retries fail.
"""
for attempt in range(max_retries):
try:
response = requests.post(
RENTATUBE_API_URL,
headers={
"X-API-Key": RENTATUBE_API_KEY,
"Content-Type": "application/json",
},
json={
"request": {
"url": url,
"method": "GET",
"headers": {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
},
},
"country": country,
},
timeout=30,
)
data = response.json()
status = data.get("statusCode", 0)
if status == 200:
return data.get("body", "")
if status in (403, 429, 503):
wait = (2 ** attempt) + random.uniform(0, 2)
print(f" Attempt {attempt + 1}: Got {status}, retrying in {wait:.1f}s...")
time.sleep(wait)
continue
except requests.exceptions.RequestException as e:
print(f" Attempt {attempt + 1}: Request error: {e}")
time.sleep(2)
return None
Why Not Just Use requests Directly?
You might wonder why you cannot just call requests.get(url) without a proxy. For many websites, you can — and you should start there for testing. But in production, direct requests fail for several reasons:
- IP blocks: Your server’s datacenter IP gets flagged after a handful of requests.
- Geo-restrictions: Some content is only available from specific countries.
- Rate limiting: Without IP rotation, hitting the same site repeatedly quickly triggers blocks.
A residential proxy rotates your IP with every request, making each request appear to come from a different household. This is the foundation of reliable scraping at any meaningful scale.
Step 3: Clean the HTML
Raw HTML is full of noise — navigation menus, scripts, stylesheets, footers, ads. Sending all of this to an LLM wastes tokens and money. Clean the HTML down to the meaningful content first:
# cleaner.py
from bs4 import BeautifulSoup
def clean_html(html: str) -> str:
"""
Strip HTML down to meaningful text content.
Removes scripts, styles, nav elements, and other noise.
"""
soup = BeautifulSoup(html, "html.parser")
# Remove elements that never contain useful content
for tag in soup.find_all(["script", "style", "nav", "footer", "header", "aside", "noscript", "iframe"]):
tag.decompose()
# Remove hidden elements
for tag in soup.find_all(attrs={"style": lambda s: s and "display:none" in s.replace(" ", "")}):
tag.decompose()
# Get the main content area if it exists
main = soup.find("main") or soup.find("article") or soup.find("div", {"role": "main"})
if main:
text = main.get_text(separator="\n", strip=True)
else:
text = soup.get_text(separator="\n", strip=True)
# Remove excessive blank lines
lines = [line.strip() for line in text.splitlines() if line.strip()]
return "\n".join(lines)
This function typically reduces the input size by 70-90%, which directly reduces your LLM API costs.
Step 4: Extract Data with an LLM
This is where the AI in “AI web scraper” comes in. Instead of writing CSS selectors or regex patterns to extract specific data, you send the cleaned text to an LLM with a structured prompt.
Using OpenAI
# extractor_openai.py
import json
from openai import OpenAI
from config import OPENAI_API_KEY
client = OpenAI(api_key=OPENAI_API_KEY)
def extract_with_openai(text: str, schema: dict, instructions: str = "") -> dict | None:
"""
Extract structured data from text using OpenAI.
Args:
text: The cleaned page text.
schema: A dict describing the fields to extract.
instructions: Optional extra instructions for the LLM.
Returns:
A dict matching the schema, or None on failure.
"""
schema_description = json.dumps(schema, indent=2)
prompt = f"""Extract the following data from the text below. Return ONLY a valid JSON object matching this schema:
{schema_description}
{f"Additional instructions: {instructions}" if instructions else ""}
If a field cannot be found, use null.
TEXT:
{text[:8000]}""" # Limit text to control token usage
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a data extraction assistant. Return only valid JSON, no markdown formatting."},
{"role": "user", "content": prompt},
],
temperature=0,
max_tokens=1000,
)
content = response.choices[0].message.content.strip()
# Remove markdown code fences if present
if content.startswith("```"):
content = content.split("\n", 1)[1].rsplit("```", 1)[0]
return json.loads(content)
except (json.JSONDecodeError, Exception) as e:
print(f"Extraction failed: {e}")
return None
Using Anthropic Claude
# extractor_claude.py
import json
import anthropic
from config import ANTHROPIC_API_KEY
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
def extract_with_claude(text: str, schema: dict, instructions: str = "") -> dict | None:
"""
Extract structured data from text using Claude.
Args:
text: The cleaned page text.
schema: A dict describing the fields to extract.
instructions: Optional extra instructions for the LLM.
Returns:
A dict matching the schema, or None on failure.
"""
schema_description = json.dumps(schema, indent=2)
prompt = f"""Extract the following data from the text below. Return ONLY a valid JSON object matching this schema:
{schema_description}
{f"Additional instructions: {instructions}" if instructions else ""}
If a field cannot be found, use null.
TEXT:
{text[:8000]}"""
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1000,
messages=[
{"role": "user", "content": prompt},
],
system="You are a data extraction assistant. Return only valid JSON, no markdown formatting or explanation.",
temperature=0,
)
content = response.content[0].text.strip()
if content.startswith("```"):
content = content.split("\n", 1)[1].rsplit("```", 1)[0]
return json.loads(content)
except (json.JSONDecodeError, Exception) as e:
print(f"Extraction failed: {e}")
return None
Defining Extraction Schemas
The schema tells the LLM what data you want. Here are examples for common use cases:
# schemas.py
PRODUCT_SCHEMA = {
"name": "string - the product name",
"price": "number - the current price in dollars",
"original_price": "number or null - the original price before discount",
"currency": "string - the currency code (USD, EUR, etc.)",
"availability": "string - in stock, out of stock, or limited",
"rating": "number or null - the average rating out of 5",
"review_count": "number or null - the total number of reviews",
"description": "string - a brief product description",
}
JOB_LISTING_SCHEMA = {
"title": "string - the job title",
"company": "string - the company name",
"location": "string - the job location",
"salary_range": "string or null - the salary range if listed",
"employment_type": "string - full-time, part-time, contract, etc.",
"experience_level": "string - entry, mid, senior, etc.",
"posted_date": "string or null - when the job was posted",
"key_requirements": "list of strings - top 5 requirements",
}
ARTICLE_SCHEMA = {
"title": "string - the article title",
"author": "string or null - the author name",
"published_date": "string or null - the publication date",
"summary": "string - a 2-3 sentence summary",
"main_topics": "list of strings - the main topics covered",
}
Step 5: Put It All Together
Now combine the fetcher, cleaner, and extractor into a complete scraping pipeline:
# scraper.py
from fetcher import fetch_page
from cleaner import clean_html
from extractor_openai import extract_with_openai
# Or: from extractor_claude import extract_with_claude
from schemas import PRODUCT_SCHEMA
def scrape_product(url: str, country: str = "US") -> dict | None:
"""
Scrape a product page and extract structured data.
Args:
url: The product page URL.
country: Country code for geo-targeting the proxy.
Returns:
A dict with extracted product data, or None on failure.
"""
print(f"Fetching: {url}")
html = fetch_page(url, country=country)
if not html:
print(" Failed to fetch page")
return None
print(f" Fetched {len(html)} bytes, cleaning...")
cleaned = clean_html(html)
print(f" Cleaned to {len(cleaned)} chars")
print(" Extracting data with LLM...")
data = extract_with_openai(
text=cleaned,
schema=PRODUCT_SCHEMA,
instructions="Extract the primary product information. If multiple products appear, extract only the main/featured one.",
)
if data:
data["source_url"] = url
print(f" Extracted: {data.get('name', 'Unknown')} - {data.get('price', 'N/A')}")
return data
if __name__ == "__main__":
# Example: scrape a product page
result = scrape_product("https://example.com/product/12345")
if result:
import json
print("\nExtracted Data:")
print(json.dumps(result, indent=2))
Step 6: Scraping Multiple Pages
A single-page scraper is useful for testing, but real applications need to process multiple URLs. Here is how to build a batch scraper with basic rate limiting:
# batch_scraper.py
import json
import time
import random
from scraper import scrape_product
def scrape_product_list(urls: list[dict], output_file: str = "results.json") -> list[dict]:
"""
Scrape a list of product URLs with rate limiting.
Each item: {"url": "https://...", "country": "US"}
"""
results = []
for i, item in enumerate(urls):
print(f"\n[{i + 1}/{len(urls)}]")
data = scrape_product(item["url"], country=item.get("country", "US"))
if data:
results.append(data)
# Rate limit: wait between requests to be respectful
if i < len(urls) - 1:
delay = random.uniform(2.0, 5.0)
print(f" Waiting {delay:.1f}s before next request...")
time.sleep(delay)
# Save results
with open(output_file, "w") as f:
json.dump(results, f, indent=2)
print(f"\nDone. {len(results)}/{len(urls)} products extracted.")
print(f"Results saved to {output_file}")
return results
if __name__ == "__main__":
urls = [
{"url": "https://example.com/product/1", "country": "US"},
{"url": "https://example.com/product/2", "country": "US"},
{"url": "https://example.de/product/3", "country": "DE"},
]
scrape_product_list(urls)
When to Use Residential vs Datacenter Proxies
Not every scraping target requires a residential proxy. Here is a quick decision framework:
Use residential proxies when:
- The target site uses anti-bot protection (Cloudflare, Akamai, DataDome, PerimeterX).
- You need to access geo-restricted content.
- The site serves different content to detected bots.
- You are scraping at volume and need high success rates.
- The target is a major ecommerce, social media, or news site.
Datacenter proxies may suffice when:
- The target is a public API with no bot detection.
- You have explicit permission to scrape.
- The target is your own infrastructure.
- Speed matters more than stealth (latency is typically 1-20ms vs 50-200ms for residential).
For most AI web scraping use cases — especially when your scraper visits diverse, unknown sites at runtime — residential proxies are the safer default. The cost difference per successful request is small when you factor in the failure rate of datacenter proxies on protected sites.
Handling Common Anti-Bot Measures
Even with residential proxies, you will encounter obstacles. Here is how to handle the most common ones.
CAPTCHAs
If you are hitting CAPTCHAs regularly, your scraping pattern is too aggressive. Slow down, add more randomization to your timing, and make sure your request headers are realistic. CAPTCHAs are a warning before a full block — heed them.
If occasional CAPTCHAs are unavoidable, you have two options:
- Skip and retry: Mark the URL for retry later. A different IP and time window often bypasses the challenge.
- CAPTCHA-solving services: Third-party services can solve CAPTCHAs, but they add cost ($1-3 per 1,000 solves) and latency (10-30 seconds per solve).
Option 1 is almost always better. If you are solving CAPTCHAs at scale, something in your approach needs fixing.
JavaScript-Rendered Content
Some sites load content dynamically with JavaScript. Your fetcher gets HTML, but the actual data is loaded by client-side scripts. Signs of this:
- The extracted text is mostly navigation and boilerplate, with the main content missing.
- You see placeholder elements like “Loading…” or empty product cards.
- The LLM returns null for most fields despite the page being real.
Solutions:
- Check for API endpoints: Many JS-rendered sites load data from internal APIs. Open browser DevTools, watch the Network tab, and look for JSON API calls. Scraping the API directly is faster, cheaper, and more reliable.
- Use a headless browser: Tools like Playwright or Puppeteer render the full page including JavaScript. This is slower and more expensive but works for stubborn sites.
Rate Limit (429) Responses
A 429 response means you are sending too many requests. Your retry logic should back off exponentially:
def calculate_backoff(attempt: int, base: float = 2.0) -> float:
"""Calculate exponential backoff with jitter."""
return (base ** attempt) + random.uniform(0, base)
# Attempt 0: 1-3s, Attempt 1: 2-4s, Attempt 2: 4-6s, Attempt 3: 8-10s
Also consider reducing your overall scraping rate for that domain going forward. A 429 is the site telling you its threshold — respect it.
Redirect Loops and Soft Blocks
Some sites do not return an error code. Instead, they redirect scrapers to a different page (a “soft block”). Signs:
- You get a 200 response, but the content is a generic landing page instead of the product page.
- The URL in the response does not match the URL you requested.
- The LLM consistently extracts the same generic data from different product URLs.
Detect this by checking for expected content markers after fetching:
def is_valid_product_page(html: str) -> bool:
"""Basic check that the response looks like a real product page."""
indicators = ["add to cart", "price", "product", "buy now", "in stock"]
text = html.lower()
matches = sum(1 for indicator in indicators if indicator in text)
return matches >= 2 # At least 2 indicators should be present
Cost Optimization Tips
AI-powered scraping has two cost components: proxy requests and LLM API calls. Here is how to manage both.
Minimize LLM Token Usage
The clean_html function already helps, but you can go further:
- Truncate intelligently: Product data is almost always in the first half of the page. Truncating to 4,000-8,000 characters covers most cases.
- Use smaller models: For straightforward extraction,
gpt-4o-miniorclaude-sonnetwork just as well as larger models at a fraction of the cost. - Cache LLM results: If you re-scrape a page and the content has not changed (check with a hash), reuse the previous extraction.
Minimize Proxy Requests
- Check
robots.txtonce per domain and cache the result. Do not re-fetch it for every URL. - Deduplicate URLs before scraping. Normalize URLs by removing tracking parameters.
- Use conditional requests: Send
If-Modified-Sinceheaders when re-scraping pages you have fetched before. A304 Not Modifiedresponse saves you an LLM call.
Pay-Per-Request Pricing
With a pay-per-request proxy like RentaTube, cost optimization is straightforward: every request you eliminate saves exactly $0.001. There is no sunk cost from an unused subscription, no pressure to “use up” a monthly quota, and no surprise overages. You pay for what you use.
For a beginner scraping project making 1,000-5,000 requests per month, that is $1-5 in proxy costs. Even adding LLM API costs (~$0.50-2.00 for the same volume with a small model), the total cost of running an AI-powered scraper is remarkably low.
Complete Project Structure
Here is the final project layout:
ai-scraper/
.env # API keys (never commit this)
config.py # Configuration loader
fetcher.py # Proxy-based page fetching
cleaner.py # HTML cleaning and text extraction
extractor_openai.py # OpenAI-based data extraction
extractor_claude.py # Claude-based data extraction
schemas.py # Data extraction schemas
scraper.py # Single-page scraping pipeline
batch_scraper.py # Multi-page batch scraping
requirements.txt # Dependencies
Your requirements.txt:
requests
beautifulsoup4
openai
anthropic
python-dotenv
Where to Go from Here
You now have a working AI web scraper with proxy support. Here are natural next steps as you build on this foundation:
- Add a database: Store results in SQLite or PostgreSQL instead of JSON files. Track historical data over time.
- Schedule scraping runs: Use
cron, Celery, or a managed scheduler to run your scraper on a schedule. - Build a monitoring dashboard: Track success rates, extraction quality, and costs per domain.
- Add more schemas: Adapt the scraper for different data types — job listings, real estate, news articles, restaurant menus.
- Implement parallel scraping: Use
asyncioorconcurrent.futuresto scrape multiple pages concurrently (with appropriate rate limiting per domain).
The combination of LLM-based parsing and residential proxies makes web scraping more accessible and more reliable than it has ever been. The LLM handles the brittle parsing problem. The proxy handles the access problem. Your job is connecting them with clean, well-structured code.
Ready to start scraping? Sign up for a RentaTube API key at rentatube.dev — it takes 30 seconds with just an Ethereum wallet, and your first $1 of USDC gets you 1,000 proxy requests. No subscription, no minimum, no commitment.