Connect Claude & GPT directly to the web.Try it now
All posts
TutorialApril 9, 2026·5 min·Ilmenite Team

LangChain + Ilmenite — Web Scraping for AI Chains

Meta description: Learn how to implement langchain web scraping using Ilmenite's API. Build an AI agent that converts URLs to clean markdown for RAG pipelines and autonomous tools.

LangChain + Ilmenite — Web Scraping for AI Chains

Meta description: Learn how to implement langchain web scraping using Ilmenite's API. Build an AI agent that converts URLs to clean markdown for RAG pipelines and autonomous tools.

AI agents require clean, structured data to function. If you feed raw HTML into a Large Language Model (LLM), you waste tokens on boilerplate and risk hallucination. By integrating langchain web scraping with Ilmenite, you can convert any URL into LLM-ready markdown in a single API call.

In this tutorial, we will build an autonomous agent that can browse the web, extract relevant information from a page, and answer a user's question based on that live data.

What we're building

We are building a LangChain agent equipped with a custom web-browsing tool. This agent will use the scrape endpoint to fetch content from the web. Instead of managing a heavy headless browser infrastructure, the agent will call Ilmenite's Rust-based engine to receive clean markdown, which it then uses to synthesize an answer.

Prerequisites

Before starting, ensure you have the following:

  • An Ilmenite API key (get one at ilmenite.dev/signup)
  • An OpenAI API key (or any LLM provider supported by LangChain)
  • Python 3.10 or higher
  • The following Python packages: pip install langchain langchain-openai requests

Step 1: Implementing LangChain Web Scraping with a DocumentLoader

LangChain uses DocumentLoaders to bring data from various sources into a pipeline. Since Ilmenite returns clean markdown, it is an ideal source for RAG (Retrieval-Augmented Generation) pipelines.

Standard HTML loaders often leave behind scripts, styles, and navigation menus. Ilmenite strips this noise server-side. Because the engine is built in pure Rust, it has a 0.19ms cold start and uses only 2MB of RAM per session, making it significantly more efficient than running a local Chrome instance.

Here is how to create a custom IlmeniteLoader:

import requests
from langchain_core.documents import Document
from langchain_community.document_loaders import BaseLoader

class IlmeniteLoader(BaseLoader):
    def __init__(self, url: str, api_key: str):
        self.url = url
        self.api_key = api_key

    def load(self):
        endpoint = "https://api.ilmenite.dev/v1/scrape"
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }
        payload = {
            "url": self.url,
            "format": "markdown"
        }

        response = requests.post(endpoint, json=payload, headers=headers)
        response.raise_for_status()
        
        data = response.json()
        # Ilmenite returns the clean markdown in the 'content' field
        content = data.get("content", "")
        
        return [Document(page_content=content, metadata={"source": self.url})]

Step 2: Creating a Tool for Autonomous Agent Web Scraping

While a DocumentLoader is great for indexing, an autonomous agent needs a Tool it can call dynamically. A tool allows the agent to decide when it needs to visit a website to find an answer.

We will wrap the Ilmenite API into a LangChain Tool object. This tool will take a URL as input and return the cleaned markdown content.

from langchain.tools import Tool

def scrape_website(url: str) -> str:
    api_key = "YOUR_ILMENITE_API_KEY"
    endpoint = "https://api.ilmenite.dev/v1/scrape"
    headers = {"Authorization": f"Bearer {api_key}"}
    payload = {"url": url, "format": "markdown"}
    
    response = requests.post(endpoint, json=payload, headers=headers)
    if response.status_code == 200:
        return response.json().get("content", "No content found.")
    return f"Error scraping page: {response.status_code}"

# Define the tool for the agent
web_scraping_tool = Tool(
    name="web_scraper",
    func=scrape_website,
    description="Useful for when you need to read the content of a specific website URL. Input should be a full URL."
)

Step 3: Building the Full Agent Chain

Now we connect the tool to an LLM. We will use the OpenAI functions agent, which is highly reliable for tool calling. The agent will analyze the user's prompt, decide if it needs to use the web_scraper tool, and then process the markdown returned by Ilmenite.

from langchain_openai import ChatOpenAI
from langchain.agents import create_openai_functions_agent, AgentExecutor
from langchain import hub

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Get a standard prompt template for tool-using agents
prompt = hub.pull("hwchase17/openai-functions-agent")

# Create the agent
agent = create_openai_functions_agent(llm, [web_scraping_tool], prompt)

# Create the executor
agent_executor = AgentExecutor(agent=agent, tools=[web_scraping_tool], verbose=True)

Full Working Code Example

Below is the complete implementation. Replace YOUR_ILMENITE_API_KEY and YOUR_OPENAI_API_KEY with your actual credentials.

import os
import requests
from langchain_openai import ChatOpenAI
from langchain.tools import Tool
from langchain.agents import create_openai_functions_agent, AgentExecutor
from langchain import hub

# Configuration
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
ILMENITE_API_KEY = "YOUR_ILMENITE_API_KEY"

def scrape_website(url: str) -> str:
    """Fetches clean markdown from a URL using Ilmenite."""
    endpoint = "https://api.ilmenite.dev/v1/scrape"
    headers = {
        "Authorization": f"Bearer {ILMENITE_API_KEY}",
        "Content-Type": "application/json"
    }
    payload = {
        "url": url,
        "format": "markdown"
    }
    
    try:
        response = requests.post(endpoint, json=payload, headers=headers)
        response.raise_for_status()
        return response.json().get("content", "No content found.")
    except Exception as e:
        return f"Failed to scrape {url}: {str(e)}"

# 1. Define the tool
web_scraping_tool = Tool(
    name="web_scraper",
    func=scrape_website,
    description="Use this tool to get the content of a web page. Input must be a URL."
)

# 2. Initialize LLM and Prompt
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = hub.pull("hwchase17/openai-functions-agent")

# 3. Construct the Agent
agent = create_openai_functions_agent(llm, [web_scraping_tool], prompt)
agent_executor = AgentExecutor(agent=agent, tools=[web_scraping_tool], verbose=True)

# 4. Run the agent
query = "Go to https://ilmenite.dev and tell me what the cold start time of the browser engine is."
response = agent_executor.invoke({"input": query})

print("\nFinal Answer:\n", response["output"])

Why this approach works

Most langchain web scraping implementations rely on Playwright or Puppeteer. These libraries require you to manage a headless Chrome instance, which consumes 200-500MB of RAM per session and often crashes due to memory leaks.

Ilmenite removes this infrastructure burden. Because it is delivered as a managed API, your agent stays lightweight. You pay per scrape rather than paying for browser-hours or maintaining a server to run Chrome.

If you are scaling this to thousands of requests, the cost difference is significant. You can view our full pricing to compare the credit-based model against traditional browser-as-a-service providers.

Next steps

Now that you have a basic agent, you can expand its capabilities:

  1. Full Site Indexing: Use the /v1/crawl endpoint to index entire documentation sites into a vector database for a more robust RAG pipeline.
  2. Structured Data: Use the /v1/extract endpoint to return JSON instead of markdown when your agent needs to fill a database.
  3. Search Integration: Integrate the /v1/search endpoint so your agent can find URLs on its own before scraping them.

To explore more advanced implementations, visit the Ilmenite documentation.

Ready to build? Start for free and get 500 credits to test your agent.