Is Ilmenite really a browser?

Yes — a headless browser built in pure Rust. It parses HTML, queries CSS selectors, runs JavaScript, and extracts content. Not a Chromium fork.

How does pricing work?

Credits-based, pay per use. 1 scrape = 1 credit. Free tier includes 500 credits/month. Pay as you go after that — no subscriptions, no browser-hour metering.

All posts

Use CaseMarch 29, 2026·5 min·Ilmenite Team

Crawling Documentation Sites for Your Knowledge Base

Building a RAG (Retrieval-Augmented Generation) system requires high-quality, up-to-date data. When your source is a third-party documentation site, you need a way to crawl docs for RAG without inheri...

Ilmenite provides a streamlined path from a documentation URL to a vector database. By using a specialized web scraping API, you can bypass the infrastructure burden of managing headless browsers and focus on the embedding logic.

The problem with documentation scraping

Most modern documentation sites are built using frameworks like Next.js, Docusaurus, or GitBook. These sites rely heavily on JavaScript to render content. A simple HTTP request often returns an empty shell, meaning you need a headless browser to execute the JavaScript before you can see the text.

Even after rendering, the resulting HTML is messy. Documentation pages are packed with sidebars, version switchers, search bars, and footers. If you feed this raw HTML into an LLM, you waste tokens on boilerplate and increase the likelihood of hallucinations.

The traditional solution is to run a fleet of headless Chrome instances using Puppeteer or Playwright. However, Chrome is resource-heavy. Each instance consumes 200-500MB of RAM and suffers from slow cold starts. For a large knowledge base with thousands of pages, this infrastructure becomes expensive and unstable.

The architecture to crawl docs for RAG

To build an efficient knowledge base, you need a pipeline that discovers, extracts, and processes data. The most reliable flow involves three distinct phases: discovery, extraction, and indexing.

1. Discovery with /v1/map

Instead of blindly following every link on a page, which can lead to "crawler traps" or irrelevant sections (like blog archives), start with the /v1/map endpoint. This endpoint analyzes a domain and returns a structured list of all reachable URLs. This gives you a blueprint of the documentation site before you spend balance on full page scrapes.

2. Extraction with /v1/crawl

Once you have the list of URLs, use the /v1/crawl endpoint. This handles the JavaScript rendering and strips away the noise. Ilmenite converts the page into clean markdown. Markdown is the ideal format for RAG because it preserves structural cues (like headers and lists) that help LLMs understand the hierarchy of the information, while removing the HTML tags that confuse them.

3. Indexing and Chunking

The clean markdown is then split into smaller, overlapping chunks. These chunks are passed through an embedding model (such as OpenAI's text-embedding-3-small or Cohere) and stored in a vector database like Pinecone, Weaviate, or Qdrant. When a user asks a question, the system retrieves the most relevant markdown chunks to provide context to the LLM.

Implementation with Python

The following implementation uses the Ilmenite Python SDK to map a documentation site and extract the content for a knowledge base.

Prerequisites

You will need an API key from the Ilmenite signup page and the ilmenite and langchain libraries installed.

pip install ilmenite langchain-text-splitters

Full Implementation

import os
from ilmenite import Ilmenite
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the client
client = Ilmenite(api_key=os.environ.get("ILMENITE_API_KEY"))

target_url = "https://docs.example.com"

def build_knowledge_base(url):
    print(f"Mapping site: {url}...")
    
    # Step 1: Map the domain to find all documentation pages
    # This prevents crawling unnecessary pages like /terms or /privacy
    map_result = client.map(url)
    urls = map_result.get("urls", [])
    print(f"Found {len(urls)} pages to process.")

    # Step 2: Crawl the identified pages
    # We use the crawl endpoint to get clean markdown
    all_content = []
    for page_url in urls:
        print(f"Scraping {page_url}...")
        # The scrape endpoint returns clean markdown by default
        result = client.scrape(page_url, format="markdown")
        
        if result:
            all_content.append({
                "url": page_url,
                "text": result.get("markdown", "")
            })

    # Step 3: Chunk the markdown for the vector DB
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
        separators=["\n## ", "\n### ", "\n\n", "\n", " "]
    )

    final_chunks = []
    for page in all_content:
        chunks = text_splitter.split_text(page["text"])
        for chunk in chunks:
            final_chunks.append({
                "content": chunk,
                "metadata": {"source": page["url"]}
            })

    return final_chunks

# Execute the pipeline
knowledge_chunks = build_knowledge_base(target_url)
print(f"Generated {len(knowledge_chunks)} chunks for indexing.")

Results and performance

When crawling documentation at scale, the underlying architecture of the scraping API determines your cost and speed. Most competitors wrap Node.js and Headless Chrome. Ilmenite is built in pure Rust.

Resource Efficiency

Because Ilmenite uses a dual-engine router, it skips Chromium entirely on the fast path. While a Chrome-based session requires 200-500MB of RAM, Ilmenite uses a small per-request memory footprint. This allows for massive concurrency without crashing your infrastructure.

Latency and Cold Starts

In a RAG pipeline, latency matters. If you are updating your knowledge base in real-time, you cannot afford multi-second cold starts. Ilmenite's cold start time is the fast path. This is faster on static pages (no browser launch), ensuring that your crawl jobs start executing immediately.

Data Quality

The scrape endpoint doesn't just strip HTML; it intelligently identifies the main content area. By removing the navigation and footer, the resulting markdown contains only the technical documentation. This reduces the noise in your vector embeddings, leading to higher retrieval precision and fewer LLM hallucinations.

Metric	Ilmenite	Chrome-based API
Output Format	Clean Markdown	Raw HTML / Messy Text
Cost Model	Per operation (USD)	Per browser-hour

Going further

Once you have the basic crawling pipeline working, you can optimize your knowledge base for production.

Handling PDF Documentation

Many technical products still provide documentation in PDF format. You can integrate the Ilmenite PDF extraction feature into your pipeline. This includes OCR for scanned documents, ensuring that no part of the manual is missing from your RAG system.

Compliance and Enterprise

For enterprise teams with strict data residency or SOC 2 requirements, Ilmenite offers dedicated infrastructure, custom SLAs, and priority support on the Enterprise tier. Contact hello@ilmenite.dev to discuss options.

Automated Updates

Documentation changes frequently. Instead of re-crawling the entire site, use the /v1/map endpoint on a schedule to detect new URLs. You can then scrape only the modified pages and update the corresponding vectors in your database, keeping your AI agent's knowledge current without wasting balance.

To start building your knowledge base, you can explore our pricing or try the API in the playground.