Ilmenite vs Marker vs Firecrawl — PDF to Markdown for LLMs
If you're building a RAG pipeline, a fine-tuning dataset, or an agent that reads PDFs, you've hit the wall everyone hits: PDFs are hostile to LLMs. Double-column layouts shuffle into nonsense. Tables ...
If you're building a RAG pipeline, a fine-tuning dataset, or an agent that reads PDFs, you've hit the wall everyone hits: PDFs are hostile to LLMs. Double-column layouts shuffle into nonsense. Tables collapse into prose. Images vanish. Footnotes land mid-paragraph. The model gets garbage and you get "the document doesn't say" when the document clearly does.
So you look for a PDF→Markdown converter, and three names keep showing up: Ilmenite (disclosure: this is us), Marker (open-source, ML-first), and Firecrawl (scrape-first, PDF as a side capability). They look similar. They're not. This post benchmarks them on the same corpus and tells you which one to pick when.
We're going to be honest about what we did and didn't measure. We ran Ilmenite end-to-end against a 7-PDF corpus on a MacBook. We attempted to run Marker on the same corpus on CPU — it spent 8+ minutes on model load alone before we killed it, which is a real finding, so we're reporting it. For Marker's actual inference numbers we reference their published GPU benchmarks. For Firecrawl we use their documented pricing and behavior since we don't have an API key active on this account.
The benchmark harness and corpus are checked into the repo. You can reproduce every number here.
The three tools, in one paragraph each
Ilmenite is a pure-Rust hosted API for web scraping and PDF→Markdown. The PDF engine uses pdfium for text extraction, a conservative geometric table detector, pure-Rust OCR (ocrs/rten) for scanned pages, and an async post-pass that uploads extracted images to R2 and swaps CDN URLs into the final Markdown. Pricing is per-feature per-page — you pay base text on every page, plus a small surcharge on pages where tables, images, or OCR actually fired. No ML in the hot path unless you opt into the Scientific or Max tier.
Marker is an open-source Python library from Datalab that runs a stack of ML models end-to-end: layout detection (Surya), OCR (Surya), table recognition (Surya Table Rec), and an LLM post-pass for tricky regions. Excellent output on complex, scientific PDFs. You run it yourself — GPU for speed, CPU if you don't mind seconds-to-minutes per page. There's a hosted version (Datalab) if you don't want to manage models.
Firecrawl is primarily a web scraping API. PDF handling is secondary: give it a PDF URL, it returns Markdown via a browser-backed pipeline. Works fine for simple documents. It does not expose per-feature toggles, per-page pricing, or the kind of table and image control you get from a PDF-native engine. Pricing is credit-based.
Design philosophy — this is where they actually differ
| Dimension | Ilmenite | Marker | Firecrawl |
|---|---|---|---|
| Runtime | Rust + pdfium | Python + PyTorch (GPU ideal) | Chromium-backed |
| Approach | Geometric heuristics + optional OCR | End-to-end ML | Browser render + extract |
| Latency target | tens of ms/page for text | seconds/page GPU, 30-90s/page CPU | seconds/page |
| Footprint | ~150 MB binary | ~5 GB with Surya + table-rec models | hosted only |
| Pricing | per feature per page (micro-dollars) | self-hosted (free + GPU cost) / Datalab hosted | per-credit |
| PDF-specific | first-class product surface | entire product | secondary feature |
| Tables | conservative geometric detector | ML table rec (strong on irregular tables) | browser-flattened |
| Images | extracted → CDN → URL in markdown | extracted → disk | typically dropped |
| OCR | pure-Rust, classifier-gated | Surya OCR (heavy, accurate) | page-level via browser |
| Math / formulas | $...$ / $$...$$ (Scientific tier) | LaTeX via Surya + LLM post | typically lost |
Short version: Ilmenite is the pragmatic engineer's choice. Marker is the quality-at-any-cost choice. Firecrawl is "I'm already using it for scraping and PDFs are occasional".
Benchmark setup
Corpus, 7 PDFs, all public / CC-licensed, checked in at tests/fixtures/pdfs/:
hello.pdf— 1-page trivial controlarxiv-2604-12992.pdf— 16-page multi-column arXiv paperarxiv-bert.pdf— the BERT paper (1810.04805), 16 pagesarxiv-transformer.pdf— Attention Is All You Need (1706.03762), 15 pagesarxiv-mistral-7b.pdf— Mistral 7B (2310.06825), 9 pagesarxiv-gpt3.pdf— GPT-3 (2005.14165), 75 pagesirs-1040.pdf— US IRS Form 1040, 2 pages, heavy form grid
Hardware: M-series MacBook, CPU only. That is deliberate. Most production RAG pipelines run on commodity CPUs. Numbers that assume an H100 don't describe real deployments.
Tier settings:
- Ilmenite:
Tier::Standard(text + headings + tables + layout preservation) - Marker: default
PdfConverter. Attempted — killed after 8+ minutes on model load on CPU. We use their published GPU benchmarks below. - Firecrawl: documented behavior. No live requests in this run.
Ilmenite results — real numbers
Run with cargo bench --bench pdf_benchmark -- tests/fixtures/pdfs standard:
| pages | wall | ms/page | md size | headings | tables | |
|---|---|---|---|---|---|---|
hello.pdf | 1 | 2ms | 2 | 9 B | 0 | 0 |
arxiv-mistral-7b.pdf | 9 | 270ms | 30 | 24.3 KB | 12 | 0 |
arxiv-gpt3.pdf | 75 | 2.6s | 34 | 234.4 KB | 97 | 0 |
arxiv-transformer.pdf | 15 | 646ms | 43 | 37.7 KB | 13 | 0 |
arxiv-2604-12992.pdf | 16 | 1.3s | 82 | 48.6 KB | 12 | 0 |
arxiv-bert.pdf | 16 | 1.3s | 82 | 61.9 KB | 4 | 0 |
irs-1040.pdf | 2 | 1.6s | 793 | 11.5 KB | 0 | 2 |
Totals: 7/7 succeeded, 134 pages in 7.7s, average 58 ms/page. Total output 418 KB of Markdown.
Total cost at ilmenite's published rate card: 27,200 micro-dollars = $0.0272 for all 7 PDFs, 134 pages. The 75-page GPT-3 paper costs $0.015.
What the numbers tell you
The 75-page GPT-3 paper converts in 2.6 seconds. For a RAG pipeline that's "paste the PDF URL, index the result before the user notices." A Marker run on CPU of the same PDF would take roughly 75 × 30-60s = 37-75 minutes (per Marker's own CPU guidance).
Per-page latency drops on larger PDFs (34 ms/page on GPT-3 vs. 82 ms/page on smaller papers). That's pdfium startup overhead amortizing across more pages — a real property of the engine. For bulk ingestion, you win on long documents.
The IRS form spikes to 793 ms/page. That's table detection doing real geometric work on a dense form with a full page grid. The engine correctly emitted 2 Block::Table entries (one per page). Look at results/ilmenite/irs-1040.pdf.md to see the raw output — the table structure is preserved, though the IRS form is genuinely hard because every field is in a cell.
Zero tables detected on the arXiv papers. This is a deliberate design choice, and it's a fair limitation to call out. Ilmenite's table detector is geometry-based: it looks for ruled grid lines. arXiv papers typically use whitespace-aligned "tables" without explicit borders — the detector sees them as paragraphs. If your corpus is heavy on whitespace-table scientific papers, this is an area where Marker's ML table recognition will outperform us today. The roadmap is to add a whitespace-table detector for the Scientific tier.
Marker — what their published numbers say
From Marker's GitHub README and paper, on GPU (A100 typical): roughly 1 page/sec for the default converter on ordinary documents, slower on heavy pages where the LLM post-pass fires. On CPU they explicitly recommend not using it for anything but small tests — which matches what we saw.
That means on commodity hardware without a GPU, Marker is not a realistic option for production PDF ingestion. On GPU, 1 page/sec is fine for offline batch processing but ~60× slower than Ilmenite's text path for the same born-digital documents.
Where Marker wins: irregular tables without ruled lines, math-heavy scientific PDFs, scanned documents with non-Latin text, and anything where the layout is adversarial. Their Surya Table Rec model catches tables Ilmenite's geometric detector misses. That's real. If your corpus is 100 IEEE papers with nested equations, Marker on a GPU is the correct choice today.
Where Marker loses: operational overhead (Python + CUDA + multi-GB model weights + GPU budget), cold-start latency, throughput on CPU, cost per page on mixed or text-heavy corpora.
Firecrawl — from their docs
Firecrawl's /v1/scrape accepts PDF URLs. The pipeline renders via a browser and extracts. Expected behavior, per their documentation:
- Works: plain-text PDFs, simple layouts
- Limited: table structure (typically flattened to text), image extraction (usually dropped), math (lost)
- No per-feature toggles (can't ask for "text + tables, no images")
- Per-credit pricing at roughly $0.001-$0.003 per successful PDF scrape depending on plan
If you're already paying Firecrawl for web scraping and PDFs are an occasional thing, using their endpoint is fine. If you have real PDF volume or care about structure, use a PDF-native engine.
Cost — worked example
"RAG pipeline ingests 1,000 arXiv-style papers, ~15 pages each" = 15,000 pages.
| Tool | Per-page cost | Total | Notes |
|---|---|---|---|
| Ilmenite | ~$0.0003 | ~$4.50 | Standard tier, tables+text |
| Marker self-hosted | GPU time | ~$3-5 | A10 at $0.75/hr × ~5hr (GPU warm) |
| Marker hosted (Datalab) | their per-page | ~$15 | at their listed $0.001/page |
| Firecrawl | ~$0.002-0.003 | ~$30-45 | credit pricing, per scrape |
Numbers to within a factor of 2; run your own for your corpus. Ilmenite and Marker self-hosted are in the same cost ballpark for born-digital text corpora, and both are an order of magnitude cheaper than the browser-based alternatives.
When to pick which
Pick Ilmenite if you:
- are building a RAG or agent pipeline where predictable sub-second latency matters
- have mostly born-digital PDFs (80-90% of RAG input in the wild)
- want per-feature billing so you only pay for OCR on actually-scanned pages
- don't want to run Python + CUDA + 5 GB of model weights
- want one API surface for both web pages and PDFs
Pick Marker if you:
- have GPUs available and need highest fidelity on hard PDFs (scientific/scanned/irregular tables)
- are OK with seconds per page latency
- want to run on your own hardware (compliance, data residency, anti-vendor lock)
- process a small number of very high-value documents where quality dominates cost
Pick Firecrawl if you:
- already use it for web scraping and PDFs are an occasional aside
- don't need table structure or image extraction
- want one billing surface and accept that PDFs aren't a first-class product
The honest caveat
No benchmark matches a production workload. If your corpus is dominated by Thai scanned legal documents or financial statements with irregular table geometries, the ranking flips. The only right play is: run all three on your documents, measure your downstream task (retrieval precision, eval score), and decide from evidence.
The harness for Ilmenite's side of this is open. cargo bench --bench pdf_benchmark -- <your-pdf-dir>. Add your own PDFs to tests/fixtures/pdfs/, get the same JSON results we used here. Marker's side is their scripts/bench/run_marker.py, which works if you have the patience and hardware. Firecrawl's is scripts/bench/run_firecrawl.py if you have an API key.
What to do next
If you're in the "PDFs are breaking my RAG" pit today, the fastest unblock is: sign up, hit /v1/pdf/extract with a URL, see if the Markdown works for your downstream task, iterate. First $5 of credit is on the house. If you're running scale PDF volume and want the per-feature billing modeled on your corpus, get in touch — we'll run it for you.
For the architecture deep-dive — why pdfium, why geometric table detection, why pure-Rust OCR — see the PDF engine launch post.