Find the hidden APIs behind any website.Try it now

Blog

Notes from the forge.

Engineering deep dives, benchmarks, comparisons, and how we think about building infrastructure for AI agents.

TutorialApril 18, 2026 · 7 min

LangChain PDF Loader Alternatives — What to Use in 2026

LangChain ships several built-in PDF loaders: PyPDFLoader, UnstructuredPDFLoader, PyMuPDFLoader, PDFPlumberLoader, PDFMinerLoader, MathpixPDFLoader, AmazonTextractPDFLoader. They're not all good. Two ...

Read post
Engineering9 min

PDF to Markdown for LLMs — The Complete Guide

Every RAG pipeline hits this wall. You point an indexer at a folder of PDFs, you embed the chunks, you ship to production, and the LLM answers "the document doesn't mention that" when the document ver...

April 18, 2026Read
Comparison9 min

Ilmenite vs Marker vs Firecrawl — PDF to Markdown for LLMs

If you're building a RAG pipeline, a fine-tuning dataset, or an agent that reads PDFs, you've hit the wall everyone hits: PDFs are hostile to LLMs. Double-column layouts shuffle into nonsense. Tables ...

April 18, 2026Read
Engineering8 min

Web Scraping API for LLM Agents — The Complete Guide

An LLM agent that can't read the web is a toy. A browser plugin. A novelty that forgets the world changed after its training cutoff. The moment you want an agent to research a topic, look up a product...

April 18, 2026Read
Engineering8 min

Why PDFs Break Your RAG Pipeline (And How to Fix It)

Every RAG project that ingests real-world documents eventually runs into the same failure mode: the retriever pulls the "right" chunk, the LLM answers confidently, and the answer is wrong. You diagnos...

April 18, 2026Read
Blog3 min

PDF→Markdown, billed per feature — what we shipped

> Draft notice: This blog post is a draft. Numbers marked

April 15, 2026Read
Comparison6 min

Ilmenite vs Firecrawl — 2026 Comparison

AI agents require clean, structured web data to function. When looking for a firecrawl alternative, developers typically prioritize three things: speed of execution, cost of scaling, and the quality o...

April 14, 2026Read
Tutorial5 min

Stop Fighting Cloudflare. Find the Hidden API Instead.

Every web scraping tutorial starts the same way: "launch a headless browser, render the page, extract the content." Then you run it against a real target and Cloudflare returns Error 1010. You swap in...

April 13, 2026Read
Tutorial4 min

How to Run 1,000 Concurrent Scrapes for the Price of a Coffee

Running headless browsers at scale is usually a memory nightmare. If you use Chrome-based tools, each browser instance consumes between 200MB and 500MB of RAM. To handle 1,000 concurrent sessions, you...

April 12, 2026Read
Comparison6 min

Ilmenite vs Puppeteer — When You Don't Need a Full Browser

If you need to extract web data for an AI agent or a RAG pipeline, you have two primary paths: managing your own browser automation with a library like Puppeteer or using a managed puppeteer alternati...

April 11, 2026Read
Engineering7 min

Headless Browser vs HTTP Scraping — When You Need Each

Choosing between a headless browser vs scraping via simple HTTP requests is one of the first technical decisions a developer makes when building a data pipeline. The choice dictates your infrastructur...

April 10, 2026Read
Engineering6 min

The Real Cost of Running Chrome at Scale

Running a few headless browser instances for a small project is simple. Managing thousands of them in a production environment is a different problem entirely. For most developers, the primary concern...

April 9, 2026Read
Comparison5 min

Ilmenite vs Browserbase — Which One for AI Agents?

AI agents require reliable web data to function, but the infrastructure used to get that data varies wildly. If you are looking for a browserbase alternative, the choice usually comes down to whether ...

April 8, 2026Read
Comparison6 min

Ilmenite vs Apify — Modern Web Scraping Compared

AI agents and RAG pipelines require clean, structured data to function. While many developers start with legacy platforms, finding a modern apify alternative is often necessary when speed, cost, and L...

April 7, 2026Read
Comparison6 min

Ilmenite vs ScrapingBee — API Comparison

If you are looking for a scrapingbee alternative to power an AI agent or a RAG pipeline, the choice comes down to whether you need proxy management or AI-ready data. ScrapingBee is a powerful proxy an...

April 6, 2026Read
Comparison5 min

Ilmenite vs Steel.dev — Open Source Browser APIs

If you are building AI agents or RAG pipelines, you need a reliable way to turn URLs into data. Both Steel.dev and Ilmenite provide browser APIs that abstract away the complexity of headless browser m...

April 5, 2026Read
Engineering7 min

Web Scraping API for AI Agents — What Works in 2026

AI agents need a way to perceive the live web. A web scraping API for AI agents is not a traditional scraper designed to dump thousands of rows into a CSV; it is a specialized data pipeline that conve...

April 4, 2026Read
Tutorial5 min

Web Scraping with Python — Using the Ilmenite SDK

AI agents need clean data. If you are building a Python application, using a dedicated web scraping python api is the most efficient way to turn raw URLs into LLM-ready markdown without managing headl...

April 3, 2026Read
Tutorial4 min

Extract Structured Data from Any Website with a JSON Schema

If you are building an AI agent or a data pipeline, you don't need a wall of markdown; you need structured data. Using an extract data from website api allows you to turn unstructured HTML into a clea...

April 2, 2026Read
Tutorial5 min

How to Convert Any Website to Markdown with an API

AI agents and RAG pipelines require clean, structured data to function. Raw HTML is filled with noise—navigation bars, footer links, scripts, and CSS—that wastes LLM tokens and confuses the model. By ...

April 1, 2026Read
Tutorial5 min

Scraping React and Next.js Sites — The Complete Guide

AI agents need data from modern web apps, but if you try to scrape react website using standard HTTP libraries, you will likely receive a nearly empty HTML document. This happens because React and Nex...

March 31, 2026Read
Use Case5 min

Price Monitoring at Scale — Architecture Guide

Monitoring competitor prices at scale requires a reliable price monitoring api that can handle JavaScript rendering without the overhead of managing a browser cluster. Most e-commerce sites today use ...

March 30, 2026Read
Use Case5 min

Crawling Documentation Sites for Your Knowledge Base

Building a RAG (Retrieval-Augmented Generation) system requires high-quality, up-to-date data. When your source is a third-party documentation site, you need a way to crawl docs for RAG without inheri...

March 29, 2026Read
Tutorial5 min

LangChain + Ilmenite — Web Scraping for AI Chains

Meta description: Learn how to implement langchain web scraping using Ilmenite's API. Build an AI agent that converts URLs to clean markdown for RAG pipelines and autonomous tools.

March 28, 2026Read
Tutorial5 min

LlamaIndex + Ilmenite — Loading Live Web Data

AI agents and RAG (Retrieval-Augmented Generation) pipelines are only as good as the data they can access. This guide shows you how to implement llamaindex web scraping using Ilmenite to turn any URL ...

March 27, 2026Read
Tutorial4 min

Build a RAG Pipeline with Web Data in 10 Minutes

Retrieval-Augmented Generation (RAG) allows AI agents to access real-time, external data without the need for constant model retraining. The biggest bottleneck in any rag pipeline web scraping workflo...

March 26, 2026Read
Use Case5 min

Building an AI Research Assistant That Reads the Web

LLMs are limited by their training cut-off dates and a tendency to hallucinate when they lack specific, real-time information. To build a functional ai research assistant web tool, you must give the m...

March 25, 2026Read
Tutorial5 min

Give Claude Access to the Web with MCP

Claude cannot browse the live web natively. By using the Model Context Protocol (MCP) and Ilmenite, you can implement claude mcp web browsing to give your AI assistant the ability to read pages, crawl...

March 24, 2026Read
Engineering7 min

MCP Isn't a Plugin. It's How AI Agents Talk to the Web Now.

AI agents are limited by their training data. To interact with the live web, they need a standardized way to fetch, parse, and process data in real time. This is where an mcp server web browsing imple...

March 23, 2026Read
Engineering7 min

Markdown vs HTML — Why LLMs Prefer Markdown

Large Language Models (LLMs) do not see websites the way humans do. While a browser renders HTML into a visual layout, an LLM processes text as a sequence of tokens. When you feed raw HTML into a prom...

March 22, 2026Read
Engineering6 min

Why We Built a Headless browser in Rust

AI agents need to browse the web. To do this effectively, they require a way to render JavaScript, handle complex DOM structures, and extract clean data without the overhead of a full desktop browser....

March 21, 2026Read
Engineering7 min

Why We Wrote Our Headless browser in Rust

AI agents require a constant stream of live web data to function. Whether they are performing research, updating a RAG pipeline, or executing autonomous tasks, the bottleneck is almost always the brow...

March 20, 2026Read