LlamaIndex + Ilmenite — Loading Live Web Data
AI agents and RAG (Retrieval-Augmented Generation) pipelines are only as good as the data they can access. This guide shows you how to implement llamaindex web scraping using Ilmenite to turn any URL ...
LlamaIndex + Ilmenite — Loading Live Web Data
AI agents and RAG (Retrieval-Augmented Generation) pipelines are only as good as the data they can access. This guide shows you how to implement llamaindex web scraping using Ilmenite to turn any URL into clean, LLM-ready markdown that can be indexed and queried with natural language.
What we're building
We are building a RAG pipeline that takes a live URL, converts its content into clean markdown via the Ilmenite API, and loads that data into a LlamaIndex vector store. Once indexed, you can ask complex questions about the website's content, and the LLM will answer using the most relevant chunks of the scraped page. This removes the need to manually download HTML or manage headless browser infrastructure.
Prerequisites
To follow this tutorial, you will need:
- An Ilmenite API key (the free tier provides 500 credits/month).
- An OpenAI API key (or any other LLM provider supported by LlamaIndex).
- Python 3.9+ installed on your machine.
- The following Python packages:
llama-indexrequests
You can install the dependencies via pip:
pip install llama-index requests
Why Ilmenite for LlamaIndex Web Scraping?
Most developers use Puppeteer or Playwright for web scraping, but running headless Chrome at scale is resource-intensive. Each Chrome instance consumes 200-500MB of RAM and suffers from slow cold starts.
Ilmenite is different. It is built in pure Rust, which allows it to start in 0.19ms and use only 2MB of RAM per session. For a RAG pipeline, this means your data ingestion is faster and your infrastructure costs are significantly lower.
Furthermore, LLMs struggle with raw HTML. HTML is filled with boilerplate, navigation menus, and script tags that waste tokens and confuse the model. Ilmenite's /v1/scrape endpoint strips this noise and returns clean markdown. Markdown preserves the structural hierarchy (headers, lists, links) that LlamaIndex needs for effective chunking and embedding, without the overhead of HTML tags.
Implementing LlamaIndex Web Scraping Step-by-Step
Step 1: Scraping the page with Ilmenite
The first step is to fetch the content of a webpage. We use the /v1/scrape endpoint, which handles JavaScript rendering (React, Vue, Next.js) automatically.
In the example below, we request the output in markdown format. This is the default and most efficient format for RAG pipelines.
import requests
def fetch_web_content(url, api_key):
endpoint = "https://api.ilmenite.dev/v1/scrape"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"url": url,
"format": "markdown"
}
response = requests.post(endpoint, json=payload, headers=headers)
if response.status_code == 200:
return response.json().get("markdown")
else:
raise Exception(f"API Error: {response.status_code} - {response.text}")
# Example usage
ILMENITE_API_KEY = "your_ilmenite_key"
url = "https://example.com/blog-post"
content = fetch_web_content(url, ILMENITE_API_KEY)
print(content)
This single API call replaces an entire browser management stack. Because Ilmenite uses a 12MB Docker image and a Rust-based engine, the p95 API latency is just 47ms, ensuring your ingestion pipeline doesn't become a bottleneck.
Step 2: Loading data into LlamaIndex
Once we have the markdown string, we need to wrap it in a LlamaIndex Document object. LlamaIndex uses these objects to manage the text before it is split into chunks and converted into embeddings.
from llama_index.core import Document
# Convert the markdown string into a LlamaIndex Document
document = Document(
text=content,
metadata={
"source": url,
"title": "Example Page"
}
)
By adding the URL to the metadata, you ensure that the LLM can cite its sources when answering questions, which is critical for reducing hallucinations in production AI agents.
Step 3: Creating the Vector Index
Now we will take the document and create a VectorStoreIndex. LlamaIndex will automatically handle the chunking of the markdown text and store the embeddings in an in-memory vector store.
from llama_index.core import VectorStoreIndex
# Build the index from the document list
index = VectorStoreIndex.from_documents([document])
If you are building a larger system, you can replace the in-memory store with a production database like Pinecone, Weaviate, or Qdrant.
Step 4: Querying the live web data
The final step is to create a query engine. This allows you to ask questions in natural language. The engine will search the vector index for the most relevant markdown chunks and feed them to the LLM as context.
query_engine = index.as_query_engine()
response = query_engine.query("What are the main arguments presented in this article?")
print(response)
Scaling from one page to a whole site
The /v1/scrape endpoint is perfect for single pages. However, if you are building a comprehensive knowledge base, you will need to index entire domains.
For this, you should use the /v1/crawl endpoint. Instead of one URL, you provide a starting point and a depth limit. Ilmenite will discover all reachable pages, render them, and return the markdown for each.
Credit Cost Comparison:
| Operation | Credits | Use Case |
|---|---|---|
/v1/scrape | 1 | Single page analysis |
/v1/crawl | 1 per page | Indexing a full documentation site |
/v1/extract | 5 | Getting structured JSON from a page |
You can find more details on these operations in the Ilmenite documentation.
Full Working Code Example
Here is the complete implementation combining all the steps above.
import os
import requests
from llama_index.core import Document, VectorStoreIndex
# Configuration
ILMENITE_API_KEY = "your_ilmenite_key"
OPENAI_API_KEY = "your_openai_key"
TARGET_URL = "https://ilmenite.dev"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
def get_markdown_from_ilmenite(url):
"""Fetches clean markdown from a URL using Ilmenite API."""
endpoint = "https://api.ilmenite.dev/v1/scrape"
headers = {
"Authorization": f"Bearer {ILMENITE_API_KEY}",
"Content-Type": "application/json"
}
payload = {
"url": url,
"format": "markdown"
}
response = requests.post(endpoint, json=payload, headers=headers)
if response.status_code == 200:
return response.json().get("markdown")
else:
print(f"Error fetching {url}: {response.text}")
return None
def main():
print(f"Scraping {TARGET_URL}...")
markdown_content = get_markdown_from_ilmenite(TARGET_URL)
if not markdown_content:
print("Failed to retrieve content.")
return
# Create LlamaIndex Document
doc = Document(
text=markdown_content,
metadata={"source": TARGET_URL}
)
# Index the document
print("Indexing content...")
index = VectorStoreIndex.from_documents([doc])
# Query the index
query_engine = index.as_query_engine()
question = "What is Ilmenite and what are its performance benefits?"
print(f"\nQuestion: {question}")
response = query_engine.query(question)
print(f"Answer: {response}")
if __name__ == "__main__":
main()
Next Steps
Now that you have a basic RAG pipeline running with live web data, you can expand its capabilities:
- Implement Site-wide Indexing: Use the
/v1/crawlendpoint to load an entire documentation site into LlamaIndex instead of a single page. - Structured Data Extraction: If you need specific fields (like product prices or dates), use the
/v1/extractendpoint to get JSON instead of markdown. - Optimize Costs: Check the pricing page to see how to move from the Free tier to the Developer or Pro tiers for higher concurrency and lower credit costs.
- Explore the Playground: Test different URLs and see the markdown output in real-time using the Ilmenite playground.
Ready to build your AI agent? Sign up for Ilmenite and start scraping for free.