Connect Claude & GPT directly to the web.Try it now
All posts
TutorialApril 9, 2026·4 min·Ilmenite Team

Build a RAG Pipeline with Web Data in 10 Minutes

Retrieval-Augmented Generation (RAG) allows AI agents to access real-time, external data without the need for constant model retraining. The biggest bottleneck in any rag pipeline web scraping workflo...

Build a RAG Pipeline with Web Data in 10 Minutes: A Guide to rag pipeline web scraping

Retrieval-Augmented Generation (RAG) allows AI agents to access real-time, external data without the need for constant model retraining. The biggest bottleneck in any rag pipeline web scraping workflow is the data ingestion phase—converting messy, JavaScript-heavy websites into a format that LLMs can actually process. In this tutorial, we will build an end-to-end pipeline that scrapes a URL using Ilmenite, chunks the content, generates embeddings via OpenAI, and stores them in Pinecone for efficient retrieval.

Prerequisites

Before starting, you will need the following:

  • Ilmenite API Key: Get one by signing up for a free account.
  • OpenAI API Key: For generating embeddings and the final LLM response.
  • Pinecone API Key: A free tier account for your vector database.
  • Python 3.9+: Installed on your local machine.
  • Required Libraries: Install the necessary packages via pip: pip install requests openai pinecone-client

Step 1: Scraping clean data for your RAG pipeline

The first step in rag pipeline web scraping is turning a URL into clean text. Raw HTML is filled with navigation menus, scripts, and CSS that waste tokens and confuse LLMs.

We use the Ilmenite scrape endpoint because it handles JavaScript rendering (React, Next.js, Vue) and returns clean markdown by default. Because Ilmenite is built in pure Rust and uses only 2MB of RAM per session, it is significantly more efficient than running a headless Chrome instance on your own server.

Here is how to fetch the markdown content of a page:

import requests

ILMENITE_API_KEY = "your_ilmenite_key"
URL_TO_SCRAPE = "https://example.com/blog-post"

def scrape_page(url):
    response = requests.post(
        "https://api.ilmenite.dev/v1/scrape",
        headers={"Authorization": f"Bearer {ILMENITE_API_KEY}"},
        json={"url": url}
    )
    
    if response.status_code == 200:
        # Ilmenite returns clean markdown by default
        return response.json().get("markdown")
    else:
        raise Exception(f"Scraping failed: {response.text}")

content = scrape_page(URL_TO_SCRAPE)
print(content[:500]) # Preview the first 500 characters

Step 2: Chunking the content

LLMs have finite context windows. If you feed an entire 5,000-word technical document into a prompt, you risk "lost in the middle" degradation or exceeding token limits. Chunking breaks the markdown into smaller, overlapping segments.

For this pipeline, we will use a simple recursive character splitter. We use an overlap (e.g., 200 characters) to ensure that semantic meaning isn't lost at the cut-off point of a chunk.

def chunk_text(text, chunk_size=1000, overlap=200):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

text_chunks = chunk_text(content)
print(f"Created {len(text_chunks)} chunks from the page.")

Step 3: Generating embeddings with OpenAI

Once we have chunks, we need to convert them into vectors (mathematical representations of meaning). We will use OpenAI's text-embedding-3-small model, which provides a good balance between performance and cost.

from openai import OpenAI

client = OpenAI(api_key="your_openai_key")

def get_embedding(text):
    text = text.replace("\n", " ")
    return client.embeddings.create(
        input=[text], 
        model="text-embedding-3-small"
    ).data[0].embedding

# Generate embeddings for all chunks
embeddings = [get_embedding(chunk) for chunk in text_chunks]

Step 4: Storing vectors in Pinecone

To retrieve the most relevant data during a query, we store these vectors in Pinecone. When a user asks a question, we embed the question and perform a cosine similarity search to find the closest matching chunks.

from pinecone import Pinecone

pc = Pinecone(api_key="your_pinecone_key")
index = pc.Index("rag-index") # Ensure you've created an index with 1536 dimensions

# Upsert chunks into the vector database
vectors = []
for i, (chunk, embedding) in enumerate(zip(text_chunks, embeddings)):
    vectors.append({
        "id": f"vec_{i}", 
        "values": embedding, 
        "metadata": {"text": chunk}
    })

index.upsert(vectors=vectors)

Step 5: Retrieval and Generation

The final stage of the rag pipeline web scraping process is the retrieval loop. We take a user query, find the relevant chunks from Pinecone, and pass them to GPT-4o as context.

def query_rag(user_query):
    # 1. Embed the query
    query_vec = get_embedding(user_query)
    
    # 2. Retrieve top 3 relevant chunks
    results = index.query(vector=query_vec, top_k=3, include_metadata=True)
    context = "\n\n".join([res['metadata']['text'] for res in results['matches']])
    
    # 3. Generate answer using context
    prompt = f"Use the following context to answer the question.\n\nContext:\n{context}\n\nQuestion: {user_query}"
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

answer = query_rag("What are the main takeaways from this page?")
print(answer)

Full Working Code Example

Here is the complete implementation combined into a single script.

import requests
from openai import OpenAI
from pinecone import Pinecone

# Configuration
ILMENITE_API_KEY = "your_ilmenite_key"
OPENAI_API_KEY = "your_openai_key"
PINECONE_API_KEY = "your_pinecone_key"
URL_TO_SCRAPE = "https://example.com/blog-post"

# Initialize Clients
openai_client = OpenAI(api_key=OPENAI_API_KEY)
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("rag-index")

def scrape_page(url):
    response = requests.post(
        "https://api.ilmenite.dev/v1/scrape",
        headers={"Authorization": f"Bearer {ILMENITE_API_KEY}"},
        json={"url": url}
    )
    return response.json().get("markdown")

def chunk_text(text, chunk_size=1000, overlap=200):
    return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size - overlap)]

def get_embedding(text):
    return openai_client.embeddings.create(
        input=[text.replace("\n", " ")], 
        model="text-embedding-3-small"
    ).data[0].embedding

def run_pipeline():
    # Ingestion
    print("Scraping page...")
    content = scrape_page(URL_TO_SCRAPE)
    
    print("Chunking and embedding...")
    chunks = chunk_text(content)
    vectors = []
    for i, chunk in enumerate(chunks):
        vectors.append({
            "id": f"vec_{i}", 
            "values": get_embedding(chunk), 
            "metadata": {"text": chunk}
        })
    
    print("Upserting to Pinecone...")
    index.upsert(vectors=vectors)
    
    # Retrieval
    query = "What is the main topic of this page?"
    query_vec = get_embedding(query)
    results = index.query(vector=query_vec, top_k=3, include_metadata=True)
    context = "\n\n".join([res['metadata']['text'] for res in results['matches']])
    
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}]
    )
    
    print("\nFinal Answer:\n", response.choices[0].message.content)

if __name__ == "__main__":
    run_pipeline()

Next Steps

This tutorial covers a single page, but most production AI agents need to index entire domains. To scale this pipeline, you can use the /v1/crawl endpoint to automatically discover and scrape all pages on a site.

If you are building a high-volume pipeline, check our pricing to see how our credit-based system compares to browser-hour metering. You can also explore our documentation to learn about structured extraction using JSON schemas, which allows you to scrape specific data points instead of full-page markdown.

Ready to build? Start for free and get your API key today.