Scraping React and Next.js Sites — The Complete Guide
AI agents need data from modern web apps, but if you try to scrape react website using standard HTTP libraries, you will likely receive a nearly empty HTML document. This happens because React and Nex...
Scraping React and Next.js Sites — The Complete Guide
AI agents need data from modern web apps, but if you try to scrape react website using standard HTTP libraries, you will likely receive a nearly empty HTML document. This happens because React and Next.js rely on Client-Side Rendering (CSR), where the content is generated by JavaScript in the browser after the initial page load. To get the actual data, you need a tool that can execute JavaScript and return the rendered state.
What we're building
In this guide, we will build a data extraction pipeline that can bypass the "empty page" problem common in Single Page Applications (SPAs). We will use the Ilmenite API to render JavaScript on a React-based site, convert the resulting DOM into clean markdown, and finally extract structured JSON data from that content.
Prerequisites
To follow this tutorial, you will need:
- An API key from the Ilmenite dashboard.
- Python 3.8+ installed on your machine.
- The
requestslibrary (pip install requests). - A target URL of a React or Next.js website.
Why it is hard to scrape react website
When you visit a traditional website, the server sends a fully formed HTML document. When you use a library like requests or curl on a React site, you are only receiving the "shell" of the application.
If you inspect the source code of a React app, you will often see a body that looks like this:
<body>
<div id="root"></div>
<script src="/static/js/main.chunk.js"></script>
</body>
The actual content—the product lists, user profiles, or articles—does not exist until the browser downloads the JavaScript files and executes them to populate the #root div.
Historically, developers solved this by running headless Chrome via Puppeteer or Playwright. However, this creates a massive infrastructure burden. Each Chrome instance consumes 200-500MB of RAM and has a cold start time of 500-2,000ms.
Ilmenite solves this by using a browser engine built in pure Rust. It reduces RAM usage to ~2MB per session and achieves a cold start time of 0.19ms. For most sites, it uses its native Rust-based engine; for complex SPAs that require full V8 compatibility, it falls back to Chrome rendering.
How to scrape react website using Ilmenite
Step 1: The Basic Scrape
First, let's see what happens when we make a standard request. We will use the /v1/scrape endpoint. By default, this endpoint is highly efficient, but for React sites, we need to explicitly request JavaScript rendering.
Here is a basic curl request:
curl -X POST https://api.ilmenite.dev/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-react-site.com",
"format": "markdown"
}'
If the site is a complex React app, the markdown returned here might only contain the header and footer, missing the main content.
Step 2: Enabling JavaScript Rendering
To capture the content generated by React, you must enable JavaScript rendering. In Ilmenite, this is handled by the rendering engine. This operation costs 3 credits per request, compared to 1 credit for a standard scrape.
When render_js is enabled, Ilmenite loads the page, executes the JavaScript bundles, waits for the DOM to stabilize, and then strips away the boilerplate to give you clean markdown.
Here is how to do this in Python:
import requests
API_KEY = "YOUR_API_KEY"
URL = "https://example-react-site.com"
payload = {
"url": URL,
"format": "markdown",
"render_js": True
}
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
response = requests.post("https://api.ilmenite.dev/v1/scrape", json=payload, headers=headers)
print(response.json()['content'])
Step 3: Handling Next.js and Hydration
Next.js sites often use Server-Side Rendering (SSR) or Static Site Generation (SSG), meaning some HTML is present on load. However, "hydration" occurs when React takes over the static HTML to make it interactive.
If you are scraping a Next.js site to get data that updates dynamically (like a live price or a stock level), render_js: True is mandatory. This ensures you are seeing the state of the page after hydration is complete.
Step 4: Extracting Structured Data
Once you can successfully render the React site, you likely want the data in a structured format rather than raw markdown. This is where the /v1/extract endpoint is useful.
Instead of writing complex CSS selectors that break whenever the React component tree changes, you can provide a JSON schema. Ilmenite uses an LLM to parse the rendered markdown and return only the data you need.
extraction_payload = {
"url": URL,
"render_js": True,
"schema": {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"},
"availability": {"type": "string"}
},
"required": ["product_name", "price"]
}
}
response = requests.post("https://api.ilmenite.dev/v1/extract", json=extraction_payload, headers=headers)
print(response.json()['data'])
Full Working Example
Below is a complete script that checks if a page requires JS rendering and extracts the content.
import requests
import json
class ReactScraper:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://api.ilmenite.dev/v1"
self.headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
def scrape_page(self, url, use_js=True):
endpoint = f"{self.base_url}/scrape"
payload = {
"url": url,
"format": "markdown",
"render_js": use_js
}
try:
response = requests.post(endpoint, json=payload, headers=self.headers)
response.raise_for_status()
return response.json().get('content', '')
except requests.exceptions.RequestException as e:
print(f"Error scraping {url}: {e}")
return None
def extract_data(self, url, schema):
endpoint = f"{self.base_url}/extract"
payload = {
"url": url,
"render_js": True,
"schema": schema
}
try:
response = requests.post(endpoint, json=payload, headers=self.headers)
response.raise_for_status()
return response.json().get('data', {})
except requests.exceptions.RequestException as e:
print(f"Error extracting from {url}: {e}")
return None
# Usage
if __name__ == "__main__":
API_KEY = "YOUR_API_KEY"
scraper = ReactScraper(API_KEY)
target_url = "https://example-react-site.com/product/123"
# 1. Get clean markdown of the rendered React page
print("Fetching rendered markdown...")
markdown_content = scraper.scrape_page(target_url)
print(f"Content length: {len(markdown_content)} characters")
# 2. Extract structured data
print("\nExtracting structured data...")
my_schema = {
"type": "object",
"properties": {
"item_name": {"type": "string"},
"price": {"type": "string"},
"rating": {"type": "number"}
}
}
data = scraper.extract_data(target_url, my_schema)
print(json.dumps(data, indent=2))
Performance and Cost Considerations
When choosing a tool to scrape React websites, the infrastructure cost is the primary differentiator.
Most competitors wrap Chrome. If you run 1,000 concurrent sessions of Chrome, you need massive server clusters to handle the 200GB+ of RAM required. Ilmenite's Rust-based architecture changes this math. With ~2MB of RAM per session, a $5/month server can handle 1,000 concurrent sessions.
From a pricing perspective, Ilmenite uses a credit-based system:
- Standard Scrape: 1 credit.
- JS Rendering: 3 credits.
- LLM Extraction: 5 credits.
This means you only pay for the compute you actually use. You are not charged for "browser-hours" or the time a tab stays open, which is common in other cloud browser services.
Next Steps
Now that you can handle dynamic React and Next.js sites, you can integrate this data into a larger AI pipeline.
- Build a Knowledge Base: Use the crawl endpoint to discover all pages on a React site and index them into a vector database.
- Automate Data Extraction: Combine the
/v1/searchendpoint with/v1/extractto find and scrape React-based competitors automatically. - Optimize Costs: Review our pricing page to see how to move from the Free tier to the Developer or Pro tiers for higher concurrency.
Ready to start? Sign up for a free account and try the playground to see how your target site renders in real-time.