Web Scraping with Python — Using the Ilmenite SDK
AI agents need clean data. If you are building a Python application, using a dedicated web scraping python api is the most efficient way to turn raw URLs into LLM-ready markdown without managing headl...
Web Scraping with Python — Using the Ilmenite SDK
AI agents need clean data. If you are building a Python application, using a dedicated web scraping python api is the most efficient way to turn raw URLs into LLM-ready markdown without managing headless browser infrastructure. In this guide, we will walk through implementing the Ilmenite Python SDK to scrape single pages, crawl entire sites, and handle asynchronous requests at scale.
What we're building
We are building a modular web data pipeline that can take a list of URLs and convert them into clean markdown. This pipeline will handle JavaScript-heavy pages (like those built with React or Next.js), extract structured data, and manage errors gracefully. By the end of this tutorial, you will have a production-ready script that feeds clean text into an AI agent or a RAG pipeline.
Prerequisites
Before starting, ensure you have the following:
- Python 3.8 or higher installed on your system.
- An API key from the Ilmenite dashboard.
- A basic understanding of Python
asynciofor high-performance requests. - The
ilmenitepackage installed via pip.
Getting Started with the Web Scraping Python API
Installation
First, install the SDK using pip. The SDK provides a thin wrapper around our Rust-based engine, allowing you to make requests without writing boilerplate requests or httpx code.
pip install ilmenite
Basic Scrape Implementation
The most common use case is converting a single URL to markdown. The /v1/scrape endpoint handles the rendering, cleaning, and conversion in one call. Because the engine is built in pure Rust, it features a 0.19ms cold start and uses only 2MB of RAM per session, ensuring minimal latency between your Python code and the data.
Here is the basic implementation:
from ilmenite import Ilmenite
# Initialize the client with your API key
client = Ilmenite(api_key="your_api_key_here")
# Scrape a page and get clean markdown
response = client.scrape("https://example.com")
print(response.markdown)
print(f"Page Title: {response.metadata.title}")
By default, Ilmenite strips away navigation bars, footers, and ads. This prevents your LLM from wasting tokens on boilerplate content and reduces noise in your embeddings.
Handling JavaScript-Heavy Websites
Many modern sites use React, Vue, or Angular. A standard HTTP request returns an empty shell. While our native Rust engine handles most pages, complex single-page applications (SPAs) require a full browser environment.
You can trigger a Chrome render by passing the render_js parameter. This costs 3 credits per request compared to 1 credit for a standard scrape.
response = client.scrape(
"https://react-app-example.com",
render_js=True
)
print(response.markdown)
Scaling your Web Scraping Python API with Asyncio
If you are scraping hundreds of pages for a RAG pipeline, synchronous requests will be your bottleneck. Python's asyncio allows you to fire multiple requests concurrently.
Ilmenite's API is designed for high concurrency. Because we don't wrap heavy Chrome instances for every request, we can handle thousands of concurrent sessions on minimal hardware.
Here is how to implement an asynchronous scraping loop:
import asyncio
from ilmenite import AsyncIlmenite
async def fetch_page(client, url):
try:
response = await client.scrape(url)
return response.markdown
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
async def main():
client = AsyncIlmenite(api_key="your_api_key_here")
urls = [
"https://docs.python.org/3/",
"https://rust-lang.org",
"https://ilmenite.dev/docs"
]
# Create a list of tasks to run concurrently
tasks = [fetch_page(client, url) for url in urls]
results = await asyncio.gather(*tasks)
for url, content in zip(urls, results):
print(f"Scraped {url}: {len(content) if content else 0} characters")
if __name__ == "__main__":
asyncio.run(main())
Advanced Data Extraction
Sometimes markdown is too verbose. If you need specific data—like a product price or a blog author—you can use the /v1/extract endpoint. This uses an LLM to map the page content to a JSON schema you provide. This operation costs 5 credits.
schema = {
"product_name": "string",
"price": "number",
"currency": "string",
"availability": "boolean"
}
extraction = client.extract(
url="https://ecommerce-site.com/product/123",
schema=schema
)
print(extraction.data)
# Output: {'product_name': 'Mechanical Keyboard', 'price': 129.99, 'currency': 'USD', 'availability': True}
Error Handling and Rate Limits
In a production environment, you must handle network failures and credit limits. Ilmenite returns standard HTTP status codes.
- 401 Unauthorized: Your API key is invalid.
- 429 Too Many Requests: You have exceeded your concurrent request limit (e.g., 2 for Free tier, 50 for Pro).
- 404 Not Found: The URL is unreachable.
Implement a basic retry mechanism with exponential backoff for 429 errors:
import time
from ilmenite.exceptions import RateLimitError
def safe_scrape(client, url, retries=3):
for i in range(retries):
try:
return client.scrape(url)
except RateLimitError:
wait_time = (2 ** i)
print(f"Rate limited. Retrying in {wait_time}s...")
time.sleep(wait_time)
raise Exception("Max retries exceeded")
Full Working Code Example
This script combines everything: asynchronous requests, JavaScript rendering for specific URLs, and error handling.
import asyncio
from ilmenite import AsyncIlmenite
from ilmenite.exceptions import RateLimitError
# Configuration
API_KEY = "your_api_key_here"
TARGET_URLS = [
{"url": "https://ilmenite.dev", "js": False},
{"url": "https://example-spa.com", "js": True},
{"url": "https://invalid-url-test.com", "js": False},
]
async def process_url(client, item):
url = item["url"]
use_js = item["js"]
try:
print(f"Processing {url}...")
response = await client.scrape(url, render_js=use_js)
return {"url": url, "content": response.markdown, "status": "success"}
except RateLimitError:
print(f"Rate limit hit for {url}")
return {"url": url, "content": None, "status": "rate_limited"}
except Exception as e:
print(f"Failed to scrape {url}: {e}")
return {"url": url, "content": None, "status": "error"}
async def main():
client = AsyncIlmenite(api_key=API_KEY)
tasks = [process_url(client, item) for item in TARGET_URLS]
results = await asyncio.gather(*tasks)
print("\n--- Final Results ---")
for res in results:
status = res["status"]
url = res["url"]
length = len(res["content"]) if res["content"] else 0
print(f"URL: {url} | Status: {status} | Length: {length}")
if __name__ == "__main__":
asyncio.run(main())
Next Steps
Now that you have a working Python implementation, you can expand your data pipeline using other Ilmenite features:
- Crawl entire domains: Use the crawl endpoint to index all pages of a documentation site for your RAG system.
- Search the web: Integrate the search endpoint to let your AI agent find and scrape the top results for a query in one step.
- Optimize costs: Check the pricing page to move from the Free tier to the Developer or Pro tier for higher concurrency and lower per-credit costs.
Ready to start scraping? Sign up for a free account and get 500 credits to test your Python scripts.