Find the hidden APIs behind any website.Try it now
All posts
BlogApril 15, 2026·3 min·Ilmenite Team

PDF→Markdown, billed per feature — what we shipped

> Draft notice: This blog post is a draft. Numbers marked

Draft notice: This blog post is a draft. Numbers marked [BENCHMARK PENDING] will be filled in from the public benchmark JSON files committed alongside the engine. Do not publish until every such marker has a real, traceable number — see memory project_positioning_discover.md and the recent commits sweeping fictional numbers from the blog.

What changed

We shipped a new PDF→Markdown engine inside ilmenite. It runs entirely in Rust, replaces the legacy pdftotext + tesseract subprocess chain with a single in-process pipeline, and bills per capability per page instead of a flat per-page rate. New endpoints:

  • POST /v1/pdf/extract — fetches the PDF and returns markdown + itemized billing.
  • POST /v1/pdf/estimate — same request shape, returns a cost preview without actually extracting. Free.

How the pricing works

Most providers charge a flat credit per page for PDFs regardless of what you actually need. We charge the capabilities that ran on each page. Base text extraction is the floor; tables, formulas, OCR, and ML layout each add their own per-page surcharge, billed only on pages where the capability actually fired.

CapabilityPer-page surchargeBilled when
Base text extraction$0.0001Every page
tables+$0.0002Pages where ≥1 table was detected
formulas+$0.0003Pages where ≥1 math region was detected
images+$0.0001Pages where ≥1 image was extracted
preserve_layout+$0.0001Every page in the request
ocr (auto)+$0.0008Only pages the classifier flagged scanned
quality (ML)+$0.0015Only pages the classifier flagged complex

You can pick a named tier (preset of features) or pass features directly. Five tiers map to common workloads:

TierPer-pagePages per $1When to pick it
Light$0.000110,000Plain born-digital PDFs
Standard (most popular)$0.00033,333Reports with tables
Scientific$0.00061,666Papers with math
Scanned$0.00101,000Scanned docs
Max$0.0025400Maximum fidelity

Or pass tier: "auto" and we route per page from a cheap classifier (<5ms per page). On a 100-page mixed PDF (60 simple + 30 tables + 10 scanned), auto bills the sum of per-page tier costs — typically $0.025 vs. ~$0.10–$0.20 for flat-credit competitors.

How fast it is

We benchmarked the new engine against the legacy ilmenite path, Firecrawl's Fire-PDF, Marker, and Docling on a fixed corpus of real PDFs.

Workloadilmenite (auto)ilmenite (max)Firecrawl Fire-PDFMarkerDocling
100-page born-digital report[BENCHMARK PENDING][BENCHMARK PENDING][BENCHMARK PENDING][BENCHMARK PENDING][BENCHMARK PENDING]
50-page scientific paper[BENCHMARK PENDING][BENCHMARK PENDING][BENCHMARK PENDING][BENCHMARK PENDING][BENCHMARK PENDING]
30-page scanned legal[BENCHMARK PENDING][BENCHMARK PENDING][BENCHMARK PENDING][BENCHMARK PENDING][BENCHMARK PENDING]

Source: tests/fixtures/pdfs/ + docs/pdf-engine/public-benchmark.md. Every number above traces back to a JSON file in this repo. Until those files are populated, every number remains [BENCHMARK PENDING] and this post does not ship.

Where we lose

We're going to be honest about the failure modes too. Here are the PDFs in the corpus where Marker or Docling beat us, and why:

  • [CASE PENDING] — we lose by [N]% on [METRIC]. Why: [REASON].
  • [CASE PENDING][REASON].

We'd rather tell you where the seam is than pretend it isn't there. The benchmark corpus grows whenever a customer reports a PDF that fails — every reported failure becomes a permanent regression test.

Try it

curl -X POST https://api.ilmenite.dev/v1/pdf/estimate \
  -H "Authorization: Bearer $ILMENITE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/report.pdf",
    "tier": "auto"
  }'

The estimate endpoint is free and returns the bill before you commit. There's also a paste-and-extract tool at /dashboard/pdf.

Free tier: 5,000 PDF pages/month, no credit card.

Docs: /docs/pdf-extraction.