Connect Claude & GPT directly to the web.Try it now
All posts
Use CaseApril 9, 2026·5 min·Ilmenite Team

Crawling Documentation Sites for Your Knowledge Base

Building a RAG (Retrieval-Augmented Generation) system requires high-quality, up-to-date data. When your source is a third-party documentation site, you need a way to crawl docs for RAG without inheri...

Crawling Documentation Sites: How to Crawl Docs for RAG

Building a RAG (Retrieval-Augmented Generation) system requires high-quality, up-to-date data. When your source is a third-party documentation site, you need a way to crawl docs for RAG without inheriting the noise of HTML, navigation bars, and cookie banners.

Ilmenite provides a streamlined path from a documentation URL to a vector database. By using a specialized web scraping API, you can bypass the infrastructure burden of managing headless browsers and focus on the embedding logic.

The problem with documentation scraping

Most modern documentation sites are built using frameworks like Next.js, Docusaurus, or GitBook. These sites rely heavily on JavaScript to render content. A simple HTTP request often returns an empty shell, meaning you need a browser engine to execute the JavaScript before you can see the text.

Even after rendering, the resulting HTML is messy. Documentation pages are packed with sidebars, version switchers, search bars, and footers. If you feed this raw HTML into an LLM, you waste tokens on boilerplate and increase the likelihood of hallucinations.

The traditional solution is to run a fleet of headless Chrome instances using Puppeteer or Playwright. However, Chrome is resource-heavy. Each instance consumes 200-500MB of RAM and suffers from slow cold starts. For a large knowledge base with thousands of pages, this infrastructure becomes expensive and unstable.

The architecture to crawl docs for RAG

To build an efficient knowledge base, you need a pipeline that discovers, extracts, and processes data. The most reliable flow involves three distinct phases: discovery, extraction, and indexing.

1. Discovery with /v1/map

Instead of blindly following every link on a page, which can lead to "crawler traps" or irrelevant sections (like blog archives), start with the /v1/map endpoint. This endpoint analyzes a domain and returns a structured list of all reachable URLs. This gives you a blueprint of the documentation site before you spend credits on full page scrapes.

2. Extraction with /v1/crawl

Once you have the list of URLs, use the /v1/crawl endpoint. This handles the JavaScript rendering and strips away the noise. Ilmenite converts the page into clean markdown. Markdown is the ideal format for RAG because it preserves structural cues (like headers and lists) that help LLMs understand the hierarchy of the information, while removing the HTML tags that confuse them.

3. Indexing and Chunking

The clean markdown is then split into smaller, overlapping chunks. These chunks are passed through an embedding model (such as OpenAI's text-embedding-3-small or Cohere) and stored in a vector database like Pinecone, Weaviate, or Qdrant. When a user asks a question, the system retrieves the most relevant markdown chunks to provide context to the LLM.

Implementation with Python

The following implementation uses the Ilmenite Python SDK to map a documentation site and extract the content for a knowledge base.

Prerequisites

You will need an API key from the Ilmenite signup page and the ilmenite and langchain libraries installed.

pip install ilmenite langchain-text-splitters

Full Implementation

import os
from ilmenite import Ilmenite
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the client
client = Ilmenite(api_key=os.environ.get("ILMENITE_API_KEY"))

target_url = "https://docs.example.com"

def build_knowledge_base(url):
    print(f"Mapping site: {url}...")
    
    # Step 1: Map the domain to find all documentation pages
    # This prevents crawling unnecessary pages like /terms or /privacy
    map_result = client.map(url)
    urls = map_result.get("urls", [])
    print(f"Found {len(urls)} pages to process.")

    # Step 2: Crawl the identified pages
    # We use the crawl endpoint to get clean markdown
    all_content = []
    for page_url in urls:
        print(f"Scraping {page_url}...")
        # The scrape endpoint returns clean markdown by default
        result = client.scrape(page_url, format="markdown")
        
        if result:
            all_content.append({
                "url": page_url,
                "text": result.get("markdown", "")
            })

    # Step 3: Chunk the markdown for the vector DB
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
        separators=["\n## ", "\n### ", "\n\n", "\n", " "]
    )

    final_chunks = []
    for page in all_content:
        chunks = text_splitter.split_text(page["text"])
        for chunk in chunks:
            final_chunks.append({
                "content": chunk,
                "metadata": {"source": page["url"]}
            })

    return final_chunks

# Execute the pipeline
knowledge_chunks = build_knowledge_base(target_url)
print(f"Generated {len(knowledge_chunks)} chunks for indexing.")

Results and performance

When crawling documentation at scale, the underlying architecture of the scraping API determines your cost and speed. Most competitors wrap Node.js and Headless Chrome. Ilmenite is built in pure Rust.

Resource Efficiency

Because Ilmenite uses a custom browser engine in Rust, it avoids the memory overhead of Chrome. While a Chrome-based session requires 200-500MB of RAM, Ilmenite uses approximately 2MB per session. This allows for massive concurrency without crashing your infrastructure.

Latency and Cold Starts

In a RAG pipeline, latency matters. If you are updating your knowledge base in real-time, you cannot afford multi-second cold starts. Ilmenite's cold start time is 0.19ms. This is 2,600x faster than Chrome-based alternatives, ensuring that your crawl jobs start executing immediately.

Data Quality

The scrape endpoint doesn't just strip HTML; it intelligently identifies the main content area. By removing the navigation and footer, the resulting markdown contains only the technical documentation. This reduces the noise in your vector embeddings, leading to higher retrieval precision and fewer LLM hallucinations.

MetricIlmeniteChrome-based API
Cold Start0.19ms500-2,000ms
RAM per session~2MB200-500MB
Output FormatClean MarkdownRaw HTML / Messy Text
Cost ModelPer creditPer browser-hour

Going further

Once you have the basic crawling pipeline working, you can optimize your knowledge base for production.

Handling PDF Documentation

Many technical products still provide documentation in PDF format. You can integrate the Ilmenite PDF extraction feature into your pipeline. This includes OCR for scanned documents, ensuring that no part of the manual is missing from your RAG system.

Self-Hosting for Compliance

For enterprise teams with strict data residency or SOC 2 requirements, you can deploy Ilmenite as a single binary or a 12MB Docker image. This allows you to crawl documentation within your own air-gapped environment while maintaining the performance of the Rust engine.

Automated Updates

Documentation changes frequently. Instead of re-crawling the entire site, use the /v1/map endpoint on a schedule to detect new URLs. You can then scrape only the modified pages and update the corresponding vectors in your database, keeping your AI agent's knowledge current without wasting credits.

To start building your knowledge base, you can explore our pricing or try the API in the playground.