Blog
Notes from the forge.
Engineering deep dives, benchmarks, comparisons, and how we think about building infrastructure for AI agents.
LangChain PDF Loader Alternatives — What to Use in 2026
LangChain ships several built-in PDF loaders: PyPDFLoader, UnstructuredPDFLoader, PyMuPDFLoader, PDFPlumberLoader, PDFMinerLoader, MathpixPDFLoader, AmazonTextractPDFLoader. They're not all good. Two ...
PDF to Markdown for LLMs — The Complete Guide
Every RAG pipeline hits this wall. You point an indexer at a folder of PDFs, you embed the chunks, you ship to production, and the LLM answers "the document doesn't mention that" when the document ver...
Ilmenite vs Marker vs Firecrawl — PDF to Markdown for LLMs
If you're building a RAG pipeline, a fine-tuning dataset, or an agent that reads PDFs, you've hit the wall everyone hits: PDFs are hostile to LLMs. Double-column layouts shuffle into nonsense. Tables ...
Web Scraping API for LLM Agents — The Complete Guide
An LLM agent that can't read the web is a toy. A browser plugin. A novelty that forgets the world changed after its training cutoff. The moment you want an agent to research a topic, look up a product...
Why PDFs Break Your RAG Pipeline (And How to Fix It)
Every RAG project that ingests real-world documents eventually runs into the same failure mode: the retriever pulls the "right" chunk, the LLM answers confidently, and the answer is wrong. You diagnos...
PDF→Markdown, billed per feature — what we shipped
> Draft notice: This blog post is a draft. Numbers marked
Ilmenite vs Firecrawl — 2026 Comparison
AI agents require clean, structured web data to function. When looking for a firecrawl alternative, developers typically prioritize three things: speed of execution, cost of scaling, and the quality o...
Stop Fighting Cloudflare. Find the Hidden API Instead.
Every web scraping tutorial starts the same way: "launch a headless browser, render the page, extract the content." Then you run it against a real target and Cloudflare returns Error 1010. You swap in...
How to Run 1,000 Concurrent Scrapes for the Price of a Coffee
Running headless browsers at scale is usually a memory nightmare. If you use Chrome-based tools, each browser instance consumes between 200MB and 500MB of RAM. To handle 1,000 concurrent sessions, you...
Ilmenite vs Puppeteer — When You Don't Need a Full Browser
If you need to extract web data for an AI agent or a RAG pipeline, you have two primary paths: managing your own browser automation with a library like Puppeteer or using a managed puppeteer alternati...
Headless Browser vs HTTP Scraping — When You Need Each
Choosing between a headless browser vs scraping via simple HTTP requests is one of the first technical decisions a developer makes when building a data pipeline. The choice dictates your infrastructur...
The Real Cost of Running Chrome at Scale
Running a few headless browser instances for a small project is simple. Managing thousands of them in a production environment is a different problem entirely. For most developers, the primary concern...
Ilmenite vs Browserbase — Which One for AI Agents?
AI agents require reliable web data to function, but the infrastructure used to get that data varies wildly. If you are looking for a browserbase alternative, the choice usually comes down to whether ...
Ilmenite vs Apify — Modern Web Scraping Compared
AI agents and RAG pipelines require clean, structured data to function. While many developers start with legacy platforms, finding a modern apify alternative is often necessary when speed, cost, and L...
Ilmenite vs ScrapingBee — API Comparison
If you are looking for a scrapingbee alternative to power an AI agent or a RAG pipeline, the choice comes down to whether you need proxy management or AI-ready data. ScrapingBee is a powerful proxy an...
Ilmenite vs Steel.dev — Open Source Browser APIs
If you are building AI agents or RAG pipelines, you need a reliable way to turn URLs into data. Both Steel.dev and Ilmenite provide browser APIs that abstract away the complexity of headless browser m...
Web Scraping API for AI Agents — What Works in 2026
AI agents need a way to perceive the live web. A web scraping API for AI agents is not a traditional scraper designed to dump thousands of rows into a CSV; it is a specialized data pipeline that conve...
Web Scraping with Python — Using the Ilmenite SDK
AI agents need clean data. If you are building a Python application, using a dedicated web scraping python api is the most efficient way to turn raw URLs into LLM-ready markdown without managing headl...
Extract Structured Data from Any Website with a JSON Schema
If you are building an AI agent or a data pipeline, you don't need a wall of markdown; you need structured data. Using an extract data from website api allows you to turn unstructured HTML into a clea...
How to Convert Any Website to Markdown with an API
AI agents and RAG pipelines require clean, structured data to function. Raw HTML is filled with noise—navigation bars, footer links, scripts, and CSS—that wastes LLM tokens and confuses the model. By ...
Scraping React and Next.js Sites — The Complete Guide
AI agents need data from modern web apps, but if you try to scrape react website using standard HTTP libraries, you will likely receive a nearly empty HTML document. This happens because React and Nex...
Price Monitoring at Scale — Architecture Guide
Monitoring competitor prices at scale requires a reliable price monitoring api that can handle JavaScript rendering without the overhead of managing a browser cluster. Most e-commerce sites today use ...
Crawling Documentation Sites for Your Knowledge Base
Building a RAG (Retrieval-Augmented Generation) system requires high-quality, up-to-date data. When your source is a third-party documentation site, you need a way to crawl docs for RAG without inheri...
LangChain + Ilmenite — Web Scraping for AI Chains
Meta description: Learn how to implement langchain web scraping using Ilmenite's API. Build an AI agent that converts URLs to clean markdown for RAG pipelines and autonomous tools.
LlamaIndex + Ilmenite — Loading Live Web Data
AI agents and RAG (Retrieval-Augmented Generation) pipelines are only as good as the data they can access. This guide shows you how to implement llamaindex web scraping using Ilmenite to turn any URL ...
Build a RAG Pipeline with Web Data in 10 Minutes
Retrieval-Augmented Generation (RAG) allows AI agents to access real-time, external data without the need for constant model retraining. The biggest bottleneck in any rag pipeline web scraping workflo...
Building an AI Research Assistant That Reads the Web
LLMs are limited by their training cut-off dates and a tendency to hallucinate when they lack specific, real-time information. To build a functional ai research assistant web tool, you must give the m...
Give Claude Access to the Web with MCP
Claude cannot browse the live web natively. By using the Model Context Protocol (MCP) and Ilmenite, you can implement claude mcp web browsing to give your AI assistant the ability to read pages, crawl...
MCP Isn't a Plugin. It's How AI Agents Talk to the Web Now.
AI agents are limited by their training data. To interact with the live web, they need a standardized way to fetch, parse, and process data in real time. This is where an mcp server web browsing imple...
Markdown vs HTML — Why LLMs Prefer Markdown
Large Language Models (LLMs) do not see websites the way humans do. While a browser renders HTML into a visual layout, an LLM processes text as a sequence of tokens. When you feed raw HTML into a prom...
Why We Built a Headless browser in Rust
AI agents need to browse the web. To do this effectively, they require a way to render JavaScript, handle complex DOM structures, and extract clean data without the overhead of a full desktop browser....
Why We Wrote Our Headless browser in Rust
AI agents require a constant stream of live web data to function. Whether they are performing research, updating a RAG pipeline, or executing autonomous tasks, the bottleneck is almost always the brow...