Is Ilmenite really a browser?

Yes — a headless browser built in pure Rust. It parses HTML, queries CSS selectors, runs JavaScript, and extracts content. Not a Chromium fork.

How does pricing work?

Credits-based, pay per use. 1 scrape = 1 credit. Free tier includes 500 credits/month. Pay as you go after that — no subscriptions, no browser-hour metering.

All posts

TutorialMarch 31, 2026·5 min·Ilmenite Team

Scraping React and Next.js Sites — The Complete Guide

AI agents need data from modern web apps, but if you try to scrape react website using standard HTTP libraries, you will likely receive a nearly empty HTML document. This happens because React and Next.js rely on Client-Side Rendering (CSR), where the content is generated by JavaScript in the browser after the initial page load. To get the actual data, you need a tool that can execute JavaScript and return the rendered state.

What we're building

In this guide, we will build a data extraction pipeline that can bypass the "empty page" problem common in Single Page Applications (SPAs). We will use the Ilmenite API to render JavaScript on a React-based site, convert the resulting DOM into clean markdown, and finally extract structured JSON data from that content.

Prerequisites

To follow this tutorial, you will need:

An API key from the Ilmenite dashboard.
Python 3.8+ installed on your machine.
The requests library (pip install requests).
A target URL of a React or Next.js website.

Why it is hard to scrape react website

When you visit a traditional website, the server sends a fully formed HTML document. When you use a library like requests or curl on a React site, you are only receiving the "shell" of the application.

If you inspect the source code of a React app, you will often see a body that looks like this:

<body>
  <div id="root"></div>
  <script src="/static/js/main.chunk.js"></script>
</body>

The actual content—the product lists, user profiles, or articles—does not exist until the browser downloads the JavaScript files and executes them to populate the #root div.

Historically, developers solved this by running headless Chrome via Puppeteer or Playwright. However, this creates a massive infrastructure burden. Each Chrome instance consumes 200-500MB of RAM and has a cold start time of 500-2,000ms.

Ilmenite solves this by using a headless browser built in pure Rust. It reduces RAM usage to a small per-request memory footprint and achieves a a pure-Rust fast path for static pages. For most sites, it uses its native Rust-based engine; for complex SPAs that require full V8 compatibility, it falls back to Chrome rendering.

How to scrape react website using Ilmenite

Step 1: The Basic Scrape

First, let's see what happens when we make a standard request. We will use the /v1/scrape endpoint. By default, this endpoint is highly efficient, but for React sites, we need to explicitly request JavaScript rendering.

Here is a basic curl request:

curl -X POST https://api.ilmenite.dev/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-react-site.com",
    "format": "markdown"
  }'

If the site is a complex React app, the markdown returned here might only contain the header and footer, missing the main content.

Step 2: Enabling JavaScript Rendering

To capture the content generated by React, you must enable JavaScript rendering. In Ilmenite, this is handled by a browser session at $0.005 per request, compared to $0.001 for a standard scrape.

When render_js is enabled, Ilmenite loads the page, executes the JavaScript bundles, waits for the DOM to stabilize, and then strips away the boilerplate to give you clean markdown.

Here is how to do this in Python:

import requests

API_KEY = "YOUR_API_KEY"
URL = "https://example-react-site.com"

payload = {
    "url": URL,
    "format": "markdown",
    "render_js": True
}

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

response = requests.post("https://api.ilmenite.dev/v1/scrape", json=payload, headers=headers)
print(response.json()['content'])

Step 3: Handling Next.js and Hydration

Next.js sites often use Server-Side Rendering (SSR) or Static Site Generation (SSG), meaning some HTML is present on load. However, "hydration" occurs when React takes over the static HTML to make it interactive.

If you are scraping a Next.js site to get data that updates dynamically (like a live price or a stock level), render_js: True is mandatory. This ensures you are seeing the state of the page after hydration is complete.

Step 4: Extracting Structured Data

Once you can successfully render the React site, you likely want the data in a structured format rather than raw markdown. This is where the /v1/extract endpoint is useful.

Instead of writing complex CSS selectors that break whenever the React component tree changes, you can provide a JSON schema. Ilmenite uses an LLM to parse the rendered markdown and return only the data you need.

extraction_payload = {
    "url": URL,
    "render_js": True,
    "schema": {
        "type": "object",
        "properties": {
            "product_name": {"type": "string"},
            "price": {"type": "number"},
            "availability": {"type": "string"}
        },
        "required": ["product_name", "price"]
    }
}

response = requests.post("https://api.ilmenite.dev/v1/extract", json=extraction_payload, headers=headers)
print(response.json()['data'])

Full Working Example

Below is a complete script that checks if a page requires JS rendering and extracts the content.

import requests
import json

class ReactScraper:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://api.ilmenite.dev/v1"
        self.headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }

    def scrape_page(self, url, use_js=True):
        endpoint = f"{self.base_url}/scrape"
        payload = {
            "url": url,
            "format": "markdown",
            "render_js": use_js
        }
        
        try:
            response = requests.post(endpoint, json=payload, headers=self.headers)
            response.raise_for_status()
            return response.json().get('content', '')
        except requests.exceptions.RequestException as e:
            print(f"Error scraping {url}: {e}")
            return None

    def extract_data(self, url, schema):
        endpoint = f"{self.base_url}/extract"
        payload = {
            "url": url,
            "render_js": True,
            "schema": schema
        }
        
        try:
            response = requests.post(endpoint, json=payload, headers=self.headers)
            response.raise_for_status()
            return response.json().get('data', {})
        except requests.exceptions.RequestException as e:
            print(f"Error extracting from {url}: {e}")
            return None

# Usage
if __name__ == "__main__":
    API_KEY = "YOUR_API_KEY"
    scraper = ReactScraper(API_KEY)
    
    target_url = "https://example-react-site.com/product/123"
    
    # 1. Get clean markdown of the rendered React page
    print("Fetching rendered markdown...")
    markdown_content = scraper.scrape_page(target_url)
    print(f"Content length: {len(markdown_content)} characters")

    # 2. Extract structured data
    print("\nExtracting structured data...")
    my_schema = {
        "type": "object",
        "properties": {
            "item_name": {"type": "string"},
            "price": {"type": "string"},
            "rating": {"type": "number"}
        }
    }
    data = scraper.extract_data(target_url, my_schema)
    print(json.dumps(data, indent=2))

Performance and Cost Considerations

When choosing a tool to scrape React websites, the infrastructure cost is the primary differentiator.

Most competitors wrap Chrome. If you run 1,000 concurrent sessions of Chrome, you need massive server clusters to handle the 200GB+ of RAM required. Ilmenite's Rust-based architecture changes this math. Because static pages skip the browser on the fast path, the hosted API sustains large concurrent workloads at a flat $0.001 per request.

From a pricing perspective, Ilmenite uses a prepaid USD balance:

Standard Scrape: $0.001.
Browser session (JS rendering): $0.005.
LLM Extraction: $0.005.

This means you only pay for the compute you actually use. You are not charged for "browser-hours" or the time a tab stays open, which is common in other cloud browser services.

Next Steps

Now that you can handle dynamic React and Next.js sites, you can integrate this data into a larger AI pipeline.

Build a Knowledge Base: Use the crawl endpoint to discover all pages on a React site and index them into a vector database.
Automate Data Extraction: Combine the /v1/search endpoint with /v1/extract to find and scrape React-based competitors automatically.
Optimize Costs: Review our pricing page to see how to move from the Free tier to the Developer or Pro tiers for higher concurrency.

Ready to start? Sign up for a free account and try the playground to see how your target site renders in real-time.