Is Ilmenite really a browser?

Yes — a headless browser built in pure Rust. It parses HTML, queries CSS selectors, runs JavaScript, and extracts content. Not a Chromium fork.

How does pricing work?

Credits-based, pay per use. 1 scrape = 1 credit. Free tier includes 500 credits/month. Pay as you go after that — no subscriptions, no browser-hour metering.

All posts

ComparisonApril 18, 2026·9 min·Ilmenite Team

Ilmenite vs Marker vs Firecrawl — PDF to Markdown for LLMs

If you're building a RAG pipeline, a fine-tuning dataset, or an agent that reads PDFs, you've hit the wall everyone hits: PDFs are hostile to LLMs. Double-column layouts shuffle into nonsense. Tables ...

If you're building a RAG pipeline, a fine-tuning dataset, or an agent that reads PDFs, you've hit the wall everyone hits: PDFs are hostile to LLMs. Double-column layouts shuffle into nonsense. Tables collapse into prose. Images vanish. Footnotes land mid-paragraph. The model gets garbage and you get "the document doesn't say" when the document clearly does.

So you look for a PDF→Markdown converter, and three names keep showing up: Ilmenite (disclosure: this is us), Marker (open-source, ML-first), and Firecrawl (scrape-first, PDF as a side capability). They look similar. They're not. This post benchmarks them on the same corpus and tells you which one to pick when.

We're going to be honest about what we did and didn't measure. We ran Ilmenite end-to-end against a 7-PDF corpus on a MacBook. We attempted to run Marker on the same corpus on CPU — it spent 8+ minutes on model load alone before we killed it, which is a real finding, so we're reporting it. For Marker's actual inference numbers we reference their published GPU benchmarks. For Firecrawl we use their documented pricing and behavior since we don't have an API key active on this account.

The benchmark harness and corpus are checked into the repo. You can reproduce every number here.

The three tools, in one paragraph each

Ilmenite is a pure-Rust hosted API for web scraping and PDF→Markdown. The PDF engine uses pdfium for text extraction, a conservative geometric table detector, pure-Rust OCR (ocrs/rten) for scanned pages, and an async post-pass that uploads extracted images to R2 and swaps CDN URLs into the final Markdown. Pricing is per-feature per-page — you pay base text on every page, plus a small surcharge on pages where tables, images, or OCR actually fired. No ML in the hot path unless you opt into the Scientific or Max tier.

Marker is an open-source Python library from Datalab that runs a stack of ML models end-to-end: layout detection (Surya), OCR (Surya), table recognition (Surya Table Rec), and an LLM post-pass for tricky regions. Excellent output on complex, scientific PDFs. You run it yourself — GPU for speed, CPU if you don't mind seconds-to-minutes per page. There's a hosted version (Datalab) if you don't want to manage models.

Firecrawl is primarily a web scraping API. PDF handling is secondary: give it a PDF URL, it returns Markdown via a browser-backed pipeline. Works fine for simple documents. It does not expose per-feature toggles, per-page pricing, or the kind of table and image control you get from a PDF-native engine. Pricing is credit-based.

Design philosophy — this is where they actually differ

Dimension	Ilmenite	Marker	Firecrawl
Runtime	Rust + pdfium	Python + PyTorch (GPU ideal)	Chromium-backed
Approach	Geometric heuristics + optional OCR	End-to-end ML	Browser render + extract
Latency target	tens of ms/page for text	seconds/page GPU, 30-90s/page CPU	seconds/page
Footprint	~150 MB binary	~5 GB with Surya + table-rec models	hosted only
Pricing	per feature per page (micro-dollars)	self-hosted (free + GPU cost) / Datalab hosted	per-credit
PDF-specific	first-class product surface	entire product	secondary feature
Tables	conservative geometric detector	ML table rec (strong on irregular tables)	browser-flattened
Images	extracted → CDN → URL in markdown	extracted → disk	typically dropped
OCR	pure-Rust, classifier-gated	Surya OCR (heavy, accurate)	page-level via browser
Math / formulas	$...$ / `$$...$$` (Scientific tier)	LaTeX via Surya + LLM post	typically lost

Short version: Ilmenite is the pragmatic engineer's choice. Marker is the quality-at-any-cost choice. Firecrawl is "I'm already using it for scraping and PDFs are occasional".

Benchmark setup

Corpus, 7 PDFs, all public / CC-licensed, checked in at tests/fixtures/pdfs/:

hello.pdf — 1-page trivial control
arxiv-2604-12992.pdf — 16-page multi-column arXiv paper
arxiv-bert.pdf — the BERT paper (1810.04805), 16 pages
arxiv-transformer.pdf — Attention Is All You Need (1706.03762), 15 pages
arxiv-mistral-7b.pdf — Mistral 7B (2310.06825), 9 pages
arxiv-gpt3.pdf — GPT-3 (2005.14165), 75 pages
irs-1040.pdf — US IRS Form 1040, 2 pages, heavy form grid

Hardware: M-series MacBook, CPU only. That is deliberate. Most production RAG pipelines run on commodity CPUs. Numbers that assume an H100 don't describe real deployments.

Tier settings:

Ilmenite: Tier::Standard (text + headings + tables + layout preservation)
Marker: default PdfConverter. Attempted — killed after 8+ minutes on model load on CPU. We use their published GPU benchmarks below.
Firecrawl: documented behavior. No live requests in this run.

Ilmenite results — real numbers

Run with cargo bench --bench pdf_benchmark -- tests/fixtures/pdfs standard:

PDF	pages	wall	ms/page	md size	headings	tables
`hello.pdf`	1	2ms	2	9 B	0	0
`arxiv-mistral-7b.pdf`	9	270ms	30	24.3 KB	12	0
`arxiv-gpt3.pdf`	75	2.6s	34	234.4 KB	97	0
`arxiv-transformer.pdf`	15	646ms	43	37.7 KB	13	0
`arxiv-2604-12992.pdf`	16	1.3s	82	48.6 KB	12	0
`arxiv-bert.pdf`	16	1.3s	82	61.9 KB	4	0
`irs-1040.pdf`	2	1.6s	793	11.5 KB	0	2

Totals: 7/7 succeeded, 134 pages in 7.7s, average 58 ms/page. Total output 418 KB of Markdown.

Total cost at ilmenite's published rate card: 27,200 micro-dollars = $0.0272 for all 7 PDFs, 134 pages. The 75-page GPT-3 paper costs $0.015.

What the numbers tell you

The 75-page GPT-3 paper converts in 2.6 seconds. For a RAG pipeline that's "paste the PDF URL, index the result before the user notices." A Marker run on CPU of the same PDF would take roughly 75 × 30-60s = 37-75 minutes (per Marker's own CPU guidance).

Per-page latency drops on larger PDFs (34 ms/page on GPT-3 vs. 82 ms/page on smaller papers). That's pdfium startup overhead amortizing across more pages — a real property of the engine. For bulk ingestion, you win on long documents.

The IRS form spikes to 793 ms/page. That's table detection doing real geometric work on a dense form with a full page grid. The engine correctly emitted 2 Block::Table entries (one per page). Look at results/ilmenite/irs-1040.pdf.md to see the raw output — the table structure is preserved, though the IRS form is genuinely hard because every field is in a cell.

Zero tables detected on the arXiv papers. This is a deliberate design choice, and it's a fair limitation to call out. Ilmenite's table detector is geometry-based: it looks for ruled grid lines. arXiv papers typically use whitespace-aligned "tables" without explicit borders — the detector sees them as paragraphs. If your corpus is heavy on whitespace-table scientific papers, this is an area where Marker's ML table recognition will outperform us today. The roadmap is to add a whitespace-table detector for the Scientific tier.

Marker — what their published numbers say

From Marker's GitHub README and paper, on GPU (A100 typical): roughly 1 page/sec for the default converter on ordinary documents, slower on heavy pages where the LLM post-pass fires. On CPU they explicitly recommend not using it for anything but small tests — which matches what we saw.

That means on commodity hardware without a GPU, Marker is not a realistic option for production PDF ingestion. On GPU, 1 page/sec is fine for offline batch processing but ~60× slower than Ilmenite's text path for the same born-digital documents.

Where Marker wins: irregular tables without ruled lines, math-heavy scientific PDFs, scanned documents with non-Latin text, and anything where the layout is adversarial. Their Surya Table Rec model catches tables Ilmenite's geometric detector misses. That's real. If your corpus is 100 IEEE papers with nested equations, Marker on a GPU is the correct choice today.

Where Marker loses: operational overhead (Python + CUDA + multi-GB model weights + GPU budget), cold-start latency, throughput on CPU, cost per page on mixed or text-heavy corpora.

Firecrawl — from their docs

Firecrawl's /v1/scrape accepts PDF URLs. The pipeline renders via a browser and extracts. Expected behavior, per their documentation:

Works: plain-text PDFs, simple layouts
Limited: table structure (typically flattened to text), image extraction (usually dropped), math (lost)
No per-feature toggles (can't ask for "text + tables, no images")
Per-credit pricing at roughly $0.001-$0.003 per successful PDF scrape depending on plan

If you're already paying Firecrawl for web scraping and PDFs are an occasional thing, using their endpoint is fine. If you have real PDF volume or care about structure, use a PDF-native engine.

Cost — worked example

"RAG pipeline ingests 1,000 arXiv-style papers, ~15 pages each" = 15,000 pages.

Tool	Per-page cost	Total	Notes
Ilmenite	~$0.0003	~$4.50	Standard tier, tables+text
Marker self-hosted	GPU time	~$3-5	A10 at $0.75/hr × ~5hr (GPU warm)
Marker hosted (Datalab)	their per-page	~$15	at their listed $0.001/page
Firecrawl	~$0.002-0.003	~$30-45	credit pricing, per scrape

Numbers to within a factor of 2; run your own for your corpus. Ilmenite and Marker self-hosted are in the same cost ballpark for born-digital text corpora, and both are an order of magnitude cheaper than the browser-based alternatives.

When to pick which

Pick Ilmenite if you:

are building a RAG or agent pipeline where predictable sub-second latency matters
have mostly born-digital PDFs (80-90% of RAG input in the wild)
want per-feature billing so you only pay for OCR on actually-scanned pages
don't want to run Python + CUDA + 5 GB of model weights
want one API surface for both web pages and PDFs

Pick Marker if you:

have GPUs available and need highest fidelity on hard PDFs (scientific/scanned/irregular tables)
are OK with seconds per page latency
want to run on your own hardware (compliance, data residency, anti-vendor lock)
process a small number of very high-value documents where quality dominates cost

Pick Firecrawl if you:

already use it for web scraping and PDFs are an occasional aside
don't need table structure or image extraction
want one billing surface and accept that PDFs aren't a first-class product

The honest caveat

No benchmark matches a production workload. If your corpus is dominated by Thai scanned legal documents or financial statements with irregular table geometries, the ranking flips. The only right play is: run all three on your documents, measure your downstream task (retrieval precision, eval score), and decide from evidence.

The harness for Ilmenite's side of this is open. cargo bench --bench pdf_benchmark -- <your-pdf-dir>. Add your own PDFs to tests/fixtures/pdfs/, get the same JSON results we used here. Marker's side is their scripts/bench/run_marker.py, which works if you have the patience and hardware. Firecrawl's is scripts/bench/run_firecrawl.py if you have an API key.

What to do next

If you're in the "PDFs are breaking my RAG" pit today, the fastest unblock is: sign up, hit /v1/pdf/extract with a URL, see if the Markdown works for your downstream task, iterate. First $5 of credit is on the house. If you're running scale PDF volume and want the per-feature billing modeled on your corpus, get in touch — we'll run it for you.

For the architecture deep-dive — why pdfium, why geometric table detection, why pure-Rust OCR — see the PDF engine launch post.