Find the hidden APIs behind any website.Try it now
All posts
EngineeringApril 18, 2026·9 min·Ilmenite Team

PDF to Markdown for LLMs — The Complete Guide

Every RAG pipeline hits this wall. You point an indexer at a folder of PDFs, you embed the chunks, you ship to production, and the LLM answers "the document doesn't mention that" when the document ver...

Every RAG pipeline hits this wall. You point an indexer at a folder of PDFs, you embed the chunks, you ship to production, and the LLM answers "the document doesn't mention that" when the document very clearly does. The culprit is almost always the PDF → text step, and the fix is to stop extracting text and start extracting Markdown.

This is the long-form pillar: what's actually broken about PDFs, how every class of tool tries to fix it, what tradeoffs they make, and how to evaluate a converter for your corpus without taking anyone's word for it — including ours.

If you want the shortcut, we built the benchmark comparison with real numbers on a 7-PDF corpus. This post is the map. That post is the territory.

Why PDFs are hostile to LLMs

PDFs were designed in 1993 to preserve what the page looks like when printed. Nothing about that goal cares whether a program can parse the content back out. A PDF is a list of positioned glyphs, images, and vector paths. There is no "paragraph," no "heading," no "table." Those are inferences a reader's brain performs; they are not encoded in the file.

Concretely, here's what a naive "extract text from PDF" pass does to your RAG pipeline:

Reading order collapses. On a two-column paper, the raw text stream alternates columns line-by-line. Your chunker splits on that stream, your embedder embeds "half of paragraph A, half of paragraph B, half of paragraph A, half of paragraph B," and retrieval pulls garbage. The LLM sees a sentence that starts in the method section and ends in the conclusion.

Tables collapse into prose. A 5×4 financial table becomes 20 unlabeled numbers in a row. Row and column relationships are gone. The LLM now has to guess which number is revenue vs. expenses, and it guesses wrong, confidently.

Figures vanish. Most extractors drop images or replace them with filename placeholders. Anything useful about a chart — the caption, the numeric values labeled on the axes, the trend — is lost.

Footnotes land mid-paragraph. A PDF footnote is usually a small text block at the bottom of the page with its own coordinate. A naive extractor reads it in document order, which means footnote text inserts itself between the paragraph that referenced it and the next one. Retrieval fragments.

Scanned documents produce nothing at all. If the PDF is a scanned image (common in legal, medical, government), there is no text layer. You get an empty string. The RAG pipeline silently indexes no content and nobody notices until a user complains.

Each one of these is a category of bug your LLM will not tell you about. It will answer confidently with hallucinated content because that's what retrieval-augmented models do when retrieval returns confused slop.

The three classes of solution

Every PDF→Markdown tool falls into one of three architectures. The tradeoffs are genuinely different and you should pick based on your corpus, not on which one has the slickest marketing.

Class 1: geometric extractors

Examples: pdfium, pdf-extract, pdftotext (poppler), Ilmenite's PDF engine.

How it works: parse the PDF's internal object model, read text runs with their positions, run heuristics on positions to reconstruct reading order and detect structure (headings via font size, tables via grid line geometry, paragraphs via vertical whitespace).

Strengths:

  • Fast. Sub-100ms per page for born-digital PDFs is normal.
  • Cheap. No ML inference, so no GPU or compute-heavy dependencies.
  • Deterministic. Same input produces same output forever. Easy to test.
  • Transparent. You can look at the code and understand why the output is what it is.

Weaknesses:

  • Struggles with adversarial layouts (irregular multi-column, magazine-style).
  • Can't read scanned pages on its own — needs OCR in addition.
  • Whitespace-aligned tables (common in academic papers) are invisible to ruled-line detectors.
  • Math notation is hard without dedicated detection.

When to use: born-digital PDFs at scale. RAG pipelines. Agent workflows where you need predictable latency. Anywhere throughput and cost matter.

Class 2: ML-based extractors

Examples: Marker, Nougat, Mathpix, Docling (partially ML), Unstructured.io's "hi_res" mode.

How it works: feed rendered page images into a layout-detection model, a text-recognition (OCR) model, and often a table-structure model. Compose the outputs into Markdown. Sometimes add a final LLM pass for tricky regions.

Strengths:

  • Handles hard layouts (irregular, multi-column, decorative) because the model learned the patterns.
  • Reads scanned PDFs natively — OCR is built in.
  • Catches whitespace-aligned tables that geometric detectors miss.
  • Can recover math notation into LaTeX.

Weaknesses:

  • Slow. Seconds per page on GPU; tens of seconds to minutes per page on CPU.
  • Expensive. GPU time is not free, model weights are multi-GB.
  • Non-deterministic. The same page can produce slightly different output across runs, model versions, or hardware.
  • Operationally heavy. Python + PyTorch + CUDA + model weight management.

When to use: scientific / scanned / legal corpora where quality dominates cost. Small volumes of very high-value documents. Anywhere a human would say "I'd rather it be slow and correct than fast and wrong."

Class 3: browser-based extractors

Examples: Firecrawl's PDF path, Chromium-rendered scrapers, headless-browser approaches.

How it works: render the PDF in a browser as if it were a web page, scrape the rendered DOM/text, post-process into Markdown. Sometimes OCR the rendered image for fallback.

Strengths:

  • One surface for both web pages and PDFs. If you're already paying for a browser-based scraper, PDFs "just work."
  • Doesn't require PDF-specific infrastructure.

Weaknesses:

  • Slowest of the three. Browser startup + render + extract is seconds per document minimum.
  • Loses structure. The browser flattens the PDF into DOM text; table structure usually collapses, images usually drop.
  • Can't do per-feature toggles. You pay for the full browser even on a simple text-only PDF.
  • Most expensive per document for PDF-heavy workloads.

When to use: you already use the same API for web scraping, and PDFs are an occasional secondary concern.

Pick the right tool for your corpus

A useful decision tree:

Is your corpus mostly scanned? 
├─ YES → ML-based (Marker on GPU, or Datalab hosted)
└─ NO ↓

Is your corpus mostly scientific with dense math / irregular tables?
├─ YES, volume < 1000/day → ML-based
├─ YES, volume > 1000/day → geometric for throughput + ML for re-processing the ~5% that fail quality checks
└─ NO ↓

Is your corpus mostly born-digital business documents (reports, blogs, RFCs, manuals)?
├─ YES → geometric (Ilmenite, pdfium-based tools)
└─ MIXED → geometric first, escalate failures to ML

The anti-pattern is using a browser-based extractor for scale PDF workloads. That's almost always a cost and performance disaster. Use PDF-native tools for PDFs.

What to actually evaluate

"PDFs converted to Markdown" is not an evaluation. Here's what is:

1. Retrieval precision on YOUR documents

Take 50 queries your users actually ask. For each, note which PDF and which section has the answer. Convert the PDFs with each candidate tool. Run your retriever. Measure: for each query, does the correct section land in the top-K retrieved chunks?

This is the only number that matters. Everything else is a proxy.

2. Structural coverage

On a sample of 10 PDFs from your corpus, manually identify:

  • Every heading (H1, H2, H3) — did the tool emit them as Markdown headings?
  • Every table — did the tool emit a Markdown table, a CSV blob, or nothing?
  • Every image — did the tool extract it, reference it by URL, or drop it?
  • Every footnote — is it separated from the body, or mixed in?
  • Every math expression — is it LaTeX, is it raw symbols, or is it lost?

Score each category 0-5. Add up. Compare tools.

3. Latency and cost per page

At your working volume, what does processing 1,000 pages actually cost in (a) dollars, (b) wall-clock time, (c) engineering babysitting? A "free" tool that requires a GPU and a Python environment isn't free when you factor in operations.

4. Failure mode

When a page fails (and one will), does the tool:

  • Tell you it failed, with a clear reason?
  • Silently emit empty or garbage Markdown?
  • Crash the whole pipeline?

Silent failure modes are the worst. You won't notice until your users do.

How Ilmenite's PDF engine is designed

Brief tour so you know what you're evaluating if you test ours. Full architecture is in the PDF engine launch post; the short version:

Tiered pricing model. Each PDF request specifies a Tier (Light, Standard, Scientific, Scanned, Max) or explicit Features (tables, formulas, images, preserve_layout, ocr). Cost is computed per-feature per-page — you only pay for features that actually fired on actual pages.

Classifier-driven routing. The engine inspects each page and assigns it a class (BornDigitalSimple, BornDigitalComplex, Scanned, Hybrid). In Auto tier, the classifier picks the cheapest tier that produces faithful output for that page. You don't pay OCR costs on a page with no scanned content.

Conservative table detection. The detector requires a real grid: minimum 3×3, stroke length >20pt, topology-consistent horizontals and verticals, cell-fill-ratio check. The philosophy: a false table in an LLM context is worse than a missing table, because the model trusts the | characters and hallucinates structure from them. We pick precision over recall.

Image extraction. On Scientific tier and above, the engine extracts raster images via pdfium, reencodes them as PNG, uploads to R2, and swaps the CDN URL into the final Markdown. Works in the ![caption](https://cdn.../xyz.png) standard form so downstream tools (LangChain, LlamaIndex) render them correctly.

Pure-Rust OCR. The Scanned tier uses ocrs (pure Rust, no Python or tesseract subprocess) with classifier gating — OCR only runs on pages flagged as scanned, not on every page. Latin-first model default; multilingual model loads on demand for the Scientific tier.

Pricing transparency. Every response includes a line_items array showing which features fired on how many pages at what rate. The estimate endpoint (/v1/pdf/estimate) returns a max-possible cost before you commit, so you can budget per document.

Integrating with LangChain, LlamaIndex, and raw pipelines

The output is Markdown, which means it drops into any embedding-based RAG pipeline without adapters. A sketch:

import httpx
from langchain.text_splitter import MarkdownHeaderTextSplitter

response = httpx.post(
    "https://api.ilmenite.dev/v1/pdf/extract",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"url": pdf_url, "tier": "standard"},
    timeout=60.0,
)
markdown = response.json()["markdown"]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
chunks = splitter.split_text(markdown)

# chunks is now a list of Document objects with heading metadata preserved.
# Ready for your embedder + vector store.

LlamaIndex is the same shape. Because the output is standard Markdown with proper heading hierarchy, table syntax, and image links, the downstream splitter and retriever just work — you don't need a PDF-specific loader class. That's the entire point of producing Markdown instead of raw text: downstream tooling already knows how to handle it.

The corpus-specific truth

No generic benchmark describes your corpus. We published ours on a 7-PDF public corpus to show our work, but the real test is running the tools you're evaluating against your documents with your queries and measuring your downstream retrieval quality.

If you want help with that side — scoring retrieval on your corpus, tuning the classifier for your document mix, comparing our numbers against Marker on your hardware — reach out. For most customers this turns into a 30-minute session where we import their top 20 PDFs, run the pipeline, and inspect the output together.

Further reading on this site

Summary in one paragraph

If you're building a RAG pipeline or an agent that reads PDFs, a geometric extractor (Ilmenite, pdfium-based tools) is the right default for born-digital documents — fast, cheap, predictable, and good enough on the 80-90% of real-world PDFs that RAG pipelines actually ingest. Use an ML extractor (Marker on GPU) for the scientific or scanned portion of your corpus where quality dominates cost. Avoid browser-based PDF extractors at scale; they're the most expensive and the most structure-lossy. And whatever you pick, evaluate on your documents with your queries — not on a vendor's cherry-picked demo PDF.