Is Ilmenite really a browser?

Yes — a headless browser built in pure Rust. It parses HTML, queries CSS selectors, runs JavaScript, and extracts content. Not a Chromium fork.

How does pricing work?

Credits-based, pay per use. 1 scrape = 1 credit. Free tier includes 500 credits/month. Pay as you go after that — no subscriptions, no browser-hour metering.

All posts

TutorialApril 3, 2026·5 min·Ilmenite Team

Web Scraping with Python — Using the Ilmenite SDK

AI agents need clean data. If you are building a Python application, using a dedicated web scraping python api is the most efficient way to turn raw URLs into LLM-ready markdown without managing headless browser infrastructure. In this guide, we will walk through implementing the Ilmenite Python SDK to scrape single pages, crawl entire sites, and handle asynchronous requests at scale.

What we're building

We are building a modular web data pipeline that can take a list of URLs and convert them into clean markdown. This pipeline will handle JavaScript-heavy pages (like those built with React or Next.js), extract structured data, and manage errors gracefully. By the end of this tutorial, you will have a production-ready script that feeds clean text into an AI agent or a RAG pipeline.

Prerequisites

Before starting, ensure you have the following:

Python 3.8 or higher installed on your system.
An API key from the Ilmenite dashboard.
A basic understanding of Python asyncio for high-performance requests.
The ilmenite package installed via pip.

Getting Started with the Web Scraping Python API

Installation

First, install the SDK using pip. The SDK provides a thin wrapper around our Rust-based engine, allowing you to make requests without writing boilerplate requests or httpx code.

pip install ilmenite

Basic Scrape Implementation

The most common use case is converting a single URL to markdown. The /v1/scrape endpoint handles the rendering, cleaning, and conversion in one call. Because the engine is built in pure Rust, it features a pure-Rust fast path for static pages and uses a small memory footprint for static scrapes, ensuring minimal latency between your Python code and the data.

Here is the basic implementation:

from ilmenite import Ilmenite

# Initialize the client with your API key
client = Ilmenite(api_key="your_api_key_here")

# Scrape a page and get clean markdown
response = client.scrape("https://example.com")

print(response.markdown)
print(f"Page Title: {response.metadata.title}")

By default, Ilmenite strips away navigation bars, footers, and ads. This prevents your LLM from wasting tokens on boilerplate content and reduces noise in your embeddings.

Handling JavaScript-Heavy Websites

Many modern sites use React, Vue, or Angular. A standard HTTP request returns an empty shell. While our native Rust engine handles most pages, complex single-page applications (SPAs) require a full browser environment.

You can trigger a Chrome render by passing the render_js parameter. This costs $0.005 per request (browser session) compared to $0.001 for a standard scrape.

response = client.scrape(
    "https://react-app-example.com", 
    render_js=True
)

print(response.markdown)

Scaling your Web Scraping Python API with Asyncio

If you are scraping hundreds of pages for a RAG pipeline, synchronous requests will be your bottleneck. Python's asyncio allows you to fire multiple requests concurrently.

Ilmenite's API is designed for high concurrency. Because we don't wrap heavy Chrome instances for every request, we can handle thousands of concurrent sessions on minimal hardware.

Here is how to implement an asynchronous scraping loop:

import asyncio
from ilmenite import AsyncIlmenite

async def fetch_page(client, url):
    try:
        response = await client.scrape(url)
        return response.markdown
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

async def main():
    client = AsyncIlmenite(api_key="your_api_key_here")
    urls = [
        "https://docs.python.org/3/",
        "https://rust-lang.org",
        "https://ilmenite.dev/docs"
    ]
    
    # Create a list of tasks to run concurrently
    tasks = [fetch_page(client, url) for url in urls]
    results = await asyncio.gather(*tasks)
    
    for url, content in zip(urls, results):
        print(f"Scraped {url}: {len(content) if content else 0} characters")

if __name__ == "__main__":
    asyncio.run(main())

Advanced Data Extraction

Sometimes markdown is too verbose. If you need specific data—like a product price or a blog author—you can use the /v1/extract endpoint. This uses an LLM to map the page content to a JSON schema you provide. This operation costs $0.005.

schema = {
    "product_name": "string",
    "price": "number",
    "currency": "string",
    "availability": "boolean"
}

extraction = client.extract(
    url="https://ecommerce-site.com/product/123",
    schema=schema
)

print(extraction.data) 
# Output: {'product_name': 'Mechanical Keyboard', 'price': 129.99, 'currency': 'USD', 'availability': True}

Error Handling and Rate Limits

In a production environment, you must handle network failures and balance limits. Ilmenite returns standard HTTP status codes.

401 Unauthorized: Your API key is invalid.
402 Payment Required: Your prepaid balance is too low to cover the request — top up at the dashboard.
429 Too Many Requests: You have exceeded your concurrent request limit (e.g., 2 for Free tier, 50 for Growth tier).
404 Not Found: The URL is unreachable.

Implement a basic retry mechanism with exponential backoff for 429 errors:

import time
from ilmenite.exceptions import RateLimitError

def safe_scrape(client, url, retries=3):
    for i in range(retries):
        try:
            return client.scrape(url)
        except RateLimitError:
            wait_time = (2 ** i)
            print(f"Rate limited. Retrying in {wait_time}s...")
            time.sleep(wait_time)
    raise Exception("Max retries exceeded")

Full Working Code Example

This script combines everything: asynchronous requests, JavaScript rendering for specific URLs, and error handling.

import asyncio
from ilmenite import AsyncIlmenite
from ilmenite.exceptions import RateLimitError

# Configuration
API_KEY = "your_api_key_here"
TARGET_URLS = [
    {"url": "https://ilmenite.dev", "js": False},
    {"url": "https://example-spa.com", "js": True},
    {"url": "https://invalid-url-test.com", "js": False},
]

async def process_url(client, item):
    url = item["url"]
    use_js = item["js"]
    
    try:
        print(f"Processing {url}...")
        response = await client.scrape(url, render_js=use_js)
        return {"url": url, "content": response.markdown, "status": "success"}
    except RateLimitError:
        print(f"Rate limit hit for {url}")
        return {"url": url, "content": None, "status": "rate_limited"}
    except Exception as e:
        print(f"Failed to scrape {url}: {e}")
        return {"url": url, "content": None, "status": "error"}

async def main():
    client = AsyncIlmenite(api_key=API_KEY)
    
    tasks = [process_url(client, item) for item in TARGET_URLS]
    results = await asyncio.gather(*tasks)
    
    print("\n--- Final Results ---")
    for res in results:
        status = res["status"]
        url = res["url"]
        length = len(res["content"]) if res["content"] else 0
        print(f"URL: {url} | Status: {status} | Length: {length}")

if __name__ == "__main__":
    asyncio.run(main())

Next Steps

Now that you have a working Python implementation, you can expand your data pipeline using other Ilmenite features:

Crawl entire domains: Use the crawl endpoint to index all pages of a documentation site for your RAG system.
Search the web: Integrate the search endpoint to let your AI agent find and scrape the top results for a query in one step.
Optimize costs: Check the pricing page to top up your balance for higher concurrency and volume bonuses.

Ready to start scraping? Sign up for a free account and get $5 free every month to test your Python scripts.