Crawling Documentation Sites for Your Knowledge Base
Building a RAG (Retrieval-Augmented Generation) system requires high-quality, up-to-date data. When your source is a third-party documentation site, you need a way to crawl docs for RAG without inheri...
Crawling Documentation Sites: How to Crawl Docs for RAG
Building a RAG (Retrieval-Augmented Generation) system requires high-quality, up-to-date data. When your source is a third-party documentation site, you need a way to crawl docs for RAG without inheriting the noise of HTML, navigation bars, and cookie banners.
Ilmenite provides a streamlined path from a documentation URL to a vector database. By using a specialized web scraping API, you can bypass the infrastructure burden of managing headless browsers and focus on the embedding logic.
The problem with documentation scraping
Most modern documentation sites are built using frameworks like Next.js, Docusaurus, or GitBook. These sites rely heavily on JavaScript to render content. A simple HTTP request often returns an empty shell, meaning you need a browser engine to execute the JavaScript before you can see the text.
Even after rendering, the resulting HTML is messy. Documentation pages are packed with sidebars, version switchers, search bars, and footers. If you feed this raw HTML into an LLM, you waste tokens on boilerplate and increase the likelihood of hallucinations.
The traditional solution is to run a fleet of headless Chrome instances using Puppeteer or Playwright. However, Chrome is resource-heavy. Each instance consumes 200-500MB of RAM and suffers from slow cold starts. For a large knowledge base with thousands of pages, this infrastructure becomes expensive and unstable.
The architecture to crawl docs for RAG
To build an efficient knowledge base, you need a pipeline that discovers, extracts, and processes data. The most reliable flow involves three distinct phases: discovery, extraction, and indexing.
1. Discovery with /v1/map
Instead of blindly following every link on a page, which can lead to "crawler traps" or irrelevant sections (like blog archives), start with the /v1/map endpoint. This endpoint analyzes a domain and returns a structured list of all reachable URLs. This gives you a blueprint of the documentation site before you spend credits on full page scrapes.
2. Extraction with /v1/crawl
Once you have the list of URLs, use the /v1/crawl endpoint. This handles the JavaScript rendering and strips away the noise. Ilmenite converts the page into clean markdown. Markdown is the ideal format for RAG because it preserves structural cues (like headers and lists) that help LLMs understand the hierarchy of the information, while removing the HTML tags that confuse them.
3. Indexing and Chunking
The clean markdown is then split into smaller, overlapping chunks. These chunks are passed through an embedding model (such as OpenAI's text-embedding-3-small or Cohere) and stored in a vector database like Pinecone, Weaviate, or Qdrant. When a user asks a question, the system retrieves the most relevant markdown chunks to provide context to the LLM.
Implementation with Python
The following implementation uses the Ilmenite Python SDK to map a documentation site and extract the content for a knowledge base.
Prerequisites
You will need an API key from the Ilmenite signup page and the ilmenite and langchain libraries installed.
pip install ilmenite langchain-text-splitters
Full Implementation
import os
from ilmenite import Ilmenite
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Initialize the client
client = Ilmenite(api_key=os.environ.get("ILMENITE_API_KEY"))
target_url = "https://docs.example.com"
def build_knowledge_base(url):
print(f"Mapping site: {url}...")
# Step 1: Map the domain to find all documentation pages
# This prevents crawling unnecessary pages like /terms or /privacy
map_result = client.map(url)
urls = map_result.get("urls", [])
print(f"Found {len(urls)} pages to process.")
# Step 2: Crawl the identified pages
# We use the crawl endpoint to get clean markdown
all_content = []
for page_url in urls:
print(f"Scraping {page_url}...")
# The scrape endpoint returns clean markdown by default
result = client.scrape(page_url, format="markdown")
if result:
all_content.append({
"url": page_url,
"text": result.get("markdown", "")
})
# Step 3: Chunk the markdown for the vector DB
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100,
separators=["\n## ", "\n### ", "\n\n", "\n", " "]
)
final_chunks = []
for page in all_content:
chunks = text_splitter.split_text(page["text"])
for chunk in chunks:
final_chunks.append({
"content": chunk,
"metadata": {"source": page["url"]}
})
return final_chunks
# Execute the pipeline
knowledge_chunks = build_knowledge_base(target_url)
print(f"Generated {len(knowledge_chunks)} chunks for indexing.")
Results and performance
When crawling documentation at scale, the underlying architecture of the scraping API determines your cost and speed. Most competitors wrap Node.js and Headless Chrome. Ilmenite is built in pure Rust.
Resource Efficiency
Because Ilmenite uses a custom browser engine in Rust, it avoids the memory overhead of Chrome. While a Chrome-based session requires 200-500MB of RAM, Ilmenite uses approximately 2MB per session. This allows for massive concurrency without crashing your infrastructure.
Latency and Cold Starts
In a RAG pipeline, latency matters. If you are updating your knowledge base in real-time, you cannot afford multi-second cold starts. Ilmenite's cold start time is 0.19ms. This is 2,600x faster than Chrome-based alternatives, ensuring that your crawl jobs start executing immediately.
Data Quality
The scrape endpoint doesn't just strip HTML; it intelligently identifies the main content area. By removing the navigation and footer, the resulting markdown contains only the technical documentation. This reduces the noise in your vector embeddings, leading to higher retrieval precision and fewer LLM hallucinations.
| Metric | Ilmenite | Chrome-based API |
|---|---|---|
| Cold Start | 0.19ms | 500-2,000ms |
| RAM per session | ~2MB | 200-500MB |
| Output Format | Clean Markdown | Raw HTML / Messy Text |
| Cost Model | Per credit | Per browser-hour |
Going further
Once you have the basic crawling pipeline working, you can optimize your knowledge base for production.
Handling PDF Documentation
Many technical products still provide documentation in PDF format. You can integrate the Ilmenite PDF extraction feature into your pipeline. This includes OCR for scanned documents, ensuring that no part of the manual is missing from your RAG system.
Self-Hosting for Compliance
For enterprise teams with strict data residency or SOC 2 requirements, you can deploy Ilmenite as a single binary or a 12MB Docker image. This allows you to crawl documentation within your own air-gapped environment while maintaining the performance of the Rust engine.
Automated Updates
Documentation changes frequently. Instead of re-crawling the entire site, use the /v1/map endpoint on a schedule to detect new URLs. You can then scrape only the modified pages and update the corresponding vectors in your database, keeping your AI agent's knowledge current without wasting credits.
To start building your knowledge base, you can explore our pricing or try the API in the playground.