Build a RAG Pipeline with Web Data in 10 Minutes
Retrieval-Augmented Generation (RAG) allows AI agents to access real-time, external data without the need for constant model retraining. The biggest bottleneck in any rag pipeline web scraping workflo...
Build a RAG Pipeline with Web Data in 10 Minutes: A Guide to rag pipeline web scraping
Retrieval-Augmented Generation (RAG) allows AI agents to access real-time, external data without the need for constant model retraining. The biggest bottleneck in any rag pipeline web scraping workflow is the data ingestion phase—converting messy, JavaScript-heavy websites into a format that LLMs can actually process. In this tutorial, we will build an end-to-end pipeline that scrapes a URL using Ilmenite, chunks the content, generates embeddings via OpenAI, and stores them in Pinecone for efficient retrieval.
Prerequisites
Before starting, you will need the following:
- Ilmenite API Key: Get one by signing up for a free account.
- OpenAI API Key: For generating embeddings and the final LLM response.
- Pinecone API Key: A free tier account for your vector database.
- Python 3.9+: Installed on your local machine.
- Required Libraries: Install the necessary packages via pip:
pip install requests openai pinecone-client
Step 1: Scraping clean data for your RAG pipeline
The first step in rag pipeline web scraping is turning a URL into clean text. Raw HTML is filled with navigation menus, scripts, and CSS that waste tokens and confuse LLMs.
We use the Ilmenite scrape endpoint because it handles JavaScript rendering (React, Next.js, Vue) and returns clean markdown by default. Because Ilmenite is built in pure Rust and uses only 2MB of RAM per session, it is significantly more efficient than running a headless Chrome instance on your own server.
Here is how to fetch the markdown content of a page:
import requests
ILMENITE_API_KEY = "your_ilmenite_key"
URL_TO_SCRAPE = "https://example.com/blog-post"
def scrape_page(url):
response = requests.post(
"https://api.ilmenite.dev/v1/scrape",
headers={"Authorization": f"Bearer {ILMENITE_API_KEY}"},
json={"url": url}
)
if response.status_code == 200:
# Ilmenite returns clean markdown by default
return response.json().get("markdown")
else:
raise Exception(f"Scraping failed: {response.text}")
content = scrape_page(URL_TO_SCRAPE)
print(content[:500]) # Preview the first 500 characters
Step 2: Chunking the content
LLMs have finite context windows. If you feed an entire 5,000-word technical document into a prompt, you risk "lost in the middle" degradation or exceeding token limits. Chunking breaks the markdown into smaller, overlapping segments.
For this pipeline, we will use a simple recursive character splitter. We use an overlap (e.g., 200 characters) to ensure that semantic meaning isn't lost at the cut-off point of a chunk.
def chunk_text(text, chunk_size=1000, overlap=200):
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i:i + chunk_size])
return chunks
text_chunks = chunk_text(content)
print(f"Created {len(text_chunks)} chunks from the page.")
Step 3: Generating embeddings with OpenAI
Once we have chunks, we need to convert them into vectors (mathematical representations of meaning). We will use OpenAI's text-embedding-3-small model, which provides a good balance between performance and cost.
from openai import OpenAI
client = OpenAI(api_key="your_openai_key")
def get_embedding(text):
text = text.replace("\n", " ")
return client.embeddings.create(
input=[text],
model="text-embedding-3-small"
).data[0].embedding
# Generate embeddings for all chunks
embeddings = [get_embedding(chunk) for chunk in text_chunks]
Step 4: Storing vectors in Pinecone
To retrieve the most relevant data during a query, we store these vectors in Pinecone. When a user asks a question, we embed the question and perform a cosine similarity search to find the closest matching chunks.
from pinecone import Pinecone
pc = Pinecone(api_key="your_pinecone_key")
index = pc.Index("rag-index") # Ensure you've created an index with 1536 dimensions
# Upsert chunks into the vector database
vectors = []
for i, (chunk, embedding) in enumerate(zip(text_chunks, embeddings)):
vectors.append({
"id": f"vec_{i}",
"values": embedding,
"metadata": {"text": chunk}
})
index.upsert(vectors=vectors)
Step 5: Retrieval and Generation
The final stage of the rag pipeline web scraping process is the retrieval loop. We take a user query, find the relevant chunks from Pinecone, and pass them to GPT-4o as context.
def query_rag(user_query):
# 1. Embed the query
query_vec = get_embedding(user_query)
# 2. Retrieve top 3 relevant chunks
results = index.query(vector=query_vec, top_k=3, include_metadata=True)
context = "\n\n".join([res['metadata']['text'] for res in results['matches']])
# 3. Generate answer using context
prompt = f"Use the following context to answer the question.\n\nContext:\n{context}\n\nQuestion: {user_query}"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
answer = query_rag("What are the main takeaways from this page?")
print(answer)
Full Working Code Example
Here is the complete implementation combined into a single script.
import requests
from openai import OpenAI
from pinecone import Pinecone
# Configuration
ILMENITE_API_KEY = "your_ilmenite_key"
OPENAI_API_KEY = "your_openai_key"
PINECONE_API_KEY = "your_pinecone_key"
URL_TO_SCRAPE = "https://example.com/blog-post"
# Initialize Clients
openai_client = OpenAI(api_key=OPENAI_API_KEY)
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("rag-index")
def scrape_page(url):
response = requests.post(
"https://api.ilmenite.dev/v1/scrape",
headers={"Authorization": f"Bearer {ILMENITE_API_KEY}"},
json={"url": url}
)
return response.json().get("markdown")
def chunk_text(text, chunk_size=1000, overlap=200):
return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size - overlap)]
def get_embedding(text):
return openai_client.embeddings.create(
input=[text.replace("\n", " ")],
model="text-embedding-3-small"
).data[0].embedding
def run_pipeline():
# Ingestion
print("Scraping page...")
content = scrape_page(URL_TO_SCRAPE)
print("Chunking and embedding...")
chunks = chunk_text(content)
vectors = []
for i, chunk in enumerate(chunks):
vectors.append({
"id": f"vec_{i}",
"values": get_embedding(chunk),
"metadata": {"text": chunk}
})
print("Upserting to Pinecone...")
index.upsert(vectors=vectors)
# Retrieval
query = "What is the main topic of this page?"
query_vec = get_embedding(query)
results = index.query(vector=query_vec, top_k=3, include_metadata=True)
context = "\n\n".join([res['metadata']['text'] for res in results['matches']])
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}]
)
print("\nFinal Answer:\n", response.choices[0].message.content)
if __name__ == "__main__":
run_pipeline()
Next Steps
This tutorial covers a single page, but most production AI agents need to index entire domains. To scale this pipeline, you can use the /v1/crawl endpoint to automatically discover and scrape all pages on a site.
If you are building a high-volume pipeline, check our pricing to see how our credit-based system compares to browser-hour metering. You can also explore our documentation to learn about structured extraction using JSON schemas, which allows you to scrape specific data points instead of full-page markdown.
Ready to build? Start for free and get your API key today.