PDF→Markdown, billed per feature — what we shipped
> Draft notice: This blog post is a draft. Numbers marked
Draft notice: This blog post is a draft. Numbers marked
[BENCHMARK PENDING]will be filled in from the public benchmark JSON files committed alongside the engine. Do not publish until every such marker has a real, traceable number — see memoryproject_positioning_discover.mdand the recent commits sweeping fictional numbers from the blog.
What changed
We shipped a new PDF→Markdown engine inside ilmenite. It runs entirely
in Rust, replaces the legacy pdftotext + tesseract subprocess
chain with a single in-process pipeline, and bills per capability
per page instead of a flat per-page rate. New endpoints:
POST /v1/pdf/extract— fetches the PDF and returns markdown + itemized billing.POST /v1/pdf/estimate— same request shape, returns a cost preview without actually extracting. Free.
How the pricing works
Most providers charge a flat credit per page for PDFs regardless of what you actually need. We charge the capabilities that ran on each page. Base text extraction is the floor; tables, formulas, OCR, and ML layout each add their own per-page surcharge, billed only on pages where the capability actually fired.
| Capability | Per-page surcharge | Billed when |
|---|---|---|
| Base text extraction | $0.0001 | Every page |
tables | +$0.0002 | Pages where ≥1 table was detected |
formulas | +$0.0003 | Pages where ≥1 math region was detected |
images | +$0.0001 | Pages where ≥1 image was extracted |
preserve_layout | +$0.0001 | Every page in the request |
ocr (auto) | +$0.0008 | Only pages the classifier flagged scanned |
quality (ML) | +$0.0015 | Only pages the classifier flagged complex |
You can pick a named tier (preset of features) or pass features
directly. Five tiers map to common workloads:
| Tier | Per-page | Pages per $1 | When to pick it |
|---|---|---|---|
| Light | $0.0001 | 10,000 | Plain born-digital PDFs |
| Standard (most popular) | $0.0003 | 3,333 | Reports with tables |
| Scientific | $0.0006 | 1,666 | Papers with math |
| Scanned | $0.0010 | 1,000 | Scanned docs |
| Max | $0.0025 | 400 | Maximum fidelity |
Or pass tier: "auto" and we route per page from a cheap classifier
(<5ms per page). On a 100-page mixed PDF (60 simple + 30 tables + 10
scanned), auto bills the sum of per-page tier costs — typically
$0.025 vs. ~$0.10–$0.20 for flat-credit competitors.
How fast it is
We benchmarked the new engine against the legacy ilmenite path, Firecrawl's Fire-PDF, Marker, and Docling on a fixed corpus of real PDFs.
| Workload | ilmenite (auto) | ilmenite (max) | Firecrawl Fire-PDF | Marker | Docling |
|---|---|---|---|---|---|
| 100-page born-digital report | [BENCHMARK PENDING] | [BENCHMARK PENDING] | [BENCHMARK PENDING] | [BENCHMARK PENDING] | [BENCHMARK PENDING] |
| 50-page scientific paper | [BENCHMARK PENDING] | [BENCHMARK PENDING] | [BENCHMARK PENDING] | [BENCHMARK PENDING] | [BENCHMARK PENDING] |
| 30-page scanned legal | [BENCHMARK PENDING] | [BENCHMARK PENDING] | [BENCHMARK PENDING] | [BENCHMARK PENDING] | [BENCHMARK PENDING] |
Source:
tests/fixtures/pdfs/+docs/pdf-engine/public-benchmark.md. Every number above traces back to a JSON file in this repo. Until those files are populated, every number remains[BENCHMARK PENDING]and this post does not ship.
Where we lose
We're going to be honest about the failure modes too. Here are the PDFs in the corpus where Marker or Docling beat us, and why:
[CASE PENDING]— we lose by[N]%on[METRIC]. Why:[REASON].[CASE PENDING]—[REASON].
We'd rather tell you where the seam is than pretend it isn't there. The benchmark corpus grows whenever a customer reports a PDF that fails — every reported failure becomes a permanent regression test.
Try it
curl -X POST https://api.ilmenite.dev/v1/pdf/estimate \
-H "Authorization: Bearer $ILMENITE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/report.pdf",
"tier": "auto"
}'
The estimate endpoint is free and returns the bill before you
commit. There's also a paste-and-extract tool at /dashboard/pdf.
Free tier: 5,000 PDF pages/month, no credit card.
Docs: /docs/pdf-extraction.