Find the hidden APIs behind any website.Try it now
All posts
EngineeringApril 18, 2026·8 min·Ilmenite Team

Why PDFs Break Your RAG Pipeline (And How to Fix It)

Every RAG project that ingests real-world documents eventually runs into the same failure mode: the retriever pulls the "right" chunk, the LLM answers confidently, and the answer is wrong. You diagnos...

Every RAG project that ingests real-world documents eventually runs into the same failure mode: the retriever pulls the "right" chunk, the LLM answers confidently, and the answer is wrong. You diagnose it and find the bug is already upstream — the PDF→text step mangled the document hours before anyone asked a question. This post is a tour of the five specific ways PDFs break RAG, and what actually fixes each one.

The diagnostic pattern

You know you have a PDF problem in your RAG pipeline if any of these sound familiar:

  • "The LLM says the document doesn't mention X, but I can see X on page 4 with my own eyes."
  • "The retrieval pulls the right chunk but the answer is still wrong, as if the model is confused."
  • "Two different questions retrieve the exact same chunk, even though they're about different sections."
  • "The LLM invents numbers when asked about tables."
  • "For scanned PDFs, every query returns the same three pages — the ones where text actually extracted."

All five are symptoms of a broken PDF→text pipeline. Let's walk through what each one actually is.

Bug 1: reading order collapse

PDFs encode text as positioned glyphs, not as paragraphs. A naive extractor reads glyphs in whatever order they appear in the file's internal stream — which for multi-column layouts often means alternating columns line-by-line.

The observable effect: a two-column paper gets converted to text that interleaves columns. Your chunker splits on sentences, but the sentences span columns, so each chunk contains fragments from Section 3 and Section 4 of the paper simultaneously. Your embedder embeds these frankenchunks. Your retriever retrieves them. The LLM gets disoriented slop.

The fix: reading-order reconstruction. The extractor groups text runs into lines by y-coordinate, groups lines into columns by x-coordinate clustering, and serializes columns top-to-bottom left-to-right. Modern geometric extractors (Ilmenite, recent pdftotext with -layout) do this. The old ones (naïve PDF-to-text scripts, some early LangChain loaders) do not.

How to detect: grep a converted document for half-sentences. If you see "The model achieved a BLEU score of significantly outperforms the baseline" (two half-sentences spliced), you have reading-order collapse.

Bug 2: table flattening

A 5×4 financial table in a PDF has no internal structure the file format exposes. There is no <table> tag. There's just a grid of text boxes at positions, possibly with stroked lines drawn around them.

A naive extractor reads the text boxes in whatever order and concatenates them. Row and column relationships — the entire reason you have a table — vanish. "Q1 Q2 Q3 Q4 Revenue 120 135 150 175 Expenses 80 90 100 110 ProfitMargin..." is now a word salad.

For an LLM, this is a catastrophe. The model can sometimes recover from prose ambiguity by re-reading. It cannot recover from a flattened table because the pattern of cells is gone — it doesn't know which number is Q2 revenue vs. Q3 expenses.

The fix: structural table extraction that emits Markdown tables with explicit cell boundaries:

| Metric | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|
| Revenue | 120 | 135 | 150 | 175 |
| Expenses | 80 | 90 | 100 | 110 |

Markdown tables preserve the grid, and Markdown-aware chunkers keep tables intact rather than splitting them mid-row. The LLM sees proper row/column structure and can answer row/column queries reliably.

Two ways extractors do this: geometric detection (find grid lines, reconstruct cells from intersections — fast but requires ruled lines) and ML detection (train a vision model on table-rec datasets — catches whitespace-aligned tables but expensive per page). Most RAG pipelines do fine with geometric detection for business documents and escalate to ML only for academic PDFs.

How to detect: search your converted docs for the pattern \d+\s+\d+\s+\d+\s+\d+ (four numbers in a row). If it's common, your tables are being flattened.

Bug 3: silent OCR failure

A scanned PDF looks identical to a born-digital PDF when you open it in a viewer. Internally, it's a totally different beast: the "text" is just images of text, with no text layer at all.

If your extractor doesn't run OCR, the output for a scanned page is: nothing. Empty string. No error, no warning. The RAG pipeline dutifully indexes empty documents. Retrieval finds nothing. The user asks a question about that document. The LLM answers from training-data priors.

This is the most dangerous bug in the list because it's silent. You only discover it when a user hits a specific doc and complains.

The fix: detect scanned pages and run OCR on them. The cleanest pattern is a per-page classifier: for each page, check whether there's extractable text. If yes, extract it. If no, OCR it. If partial (a "hybrid" page with a scanned image embedded in an otherwise text page), OCR just the image region.

Gotchas:

  • OCR is expensive (hundreds of ms per page even with fast engines). Don't run it on every page — run it where needed.
  • Non-Latin scripts need different OCR models. A Latin-only OCR on a Thai legal document returns garbage, which is arguably worse than returning nothing because the LLM will try to interpret the garbage.
  • Low-resolution scans produce low-quality OCR. If your input is 150dpi, consider upscaling before OCR.

How to detect: in your PDF metadata logger (you should have one), track "pages with extracted text" vs. "total pages." Any PDF with a gap there has been silently skipped.

Bug 4: footnote interleaving

Footnotes in PDFs are their own text blocks, usually at the bottom of a page. A naive extractor reads them in document order, which in PDF terms often means: right after the body text on that page. So your converted output goes:

...and the model achieved state of the art on GLUE[5].

5. We used the validation set from the original paper.

In the next section we describe...

That's a minor inconvenience for a human reader. For a chunker that splits on blank lines, it's a disaster — the footnote lands in the same chunk as the sentence before it and gets embedded as part of the "model achieved SOTA" semantic neighborhood. A query about "which validation set was used" will retrieve the chunk about SOTA results, not the footnote itself.

The fix: recognize footnotes structurally and either (a) move them to a dedicated "footnotes" section at the end of each page's output, (b) emit them as Markdown footnote syntax ([^1] ... [^1]: ...), or (c) drop them entirely for certain pipeline types (but then mark them as dropped rather than silently disappearing them).

Which option depends on your downstream task: if footnotes matter semantically (e.g., legal documents where footnotes modify the meaning of the body), option (b) preserves them in a chunk-friendly way. If footnotes are just citations, option (a) is fine.

How to detect: look for sentences that suddenly introduce unrelated short text and then resume. Your chunker will have learned not to split on those, but your embedder will mix the concepts.

Bug 5: image and figure loss

A figure in a PDF can carry the most important information in the document — a chart, a diagram, a screenshot of a UI. A naive extractor drops it entirely or emits ![](image_001.png) pointing at a file that doesn't exist in your storage.

For an LLM that supports vision (Claude 3.5, GPT-4o, Gemini 1.5), losing images means losing the parts of the document the model could actually interpret. For a text-only model, losing images at least means the caption is gone and any trend described in the figure is gone.

The fix: extract figures as images, upload to a CDN or object store, and emit proper Markdown image references with the caption preserved:

![Figure 3: Training loss over steps.](https://cdn.example.com/pdf/xyz/figure-3.png)

This lets downstream tools (a vision-aware LLM, a rendering step, a human reviewer) actually access the image. It also preserves the caption as text in the Markdown stream — captions are often the most semantically dense part of a figure.

Gotchas:

  • Don't emit images as base64 data URIs in the Markdown body. They blow up chunk sizes and waste embedding tokens. Use a URL reference to a CDN-hosted image.
  • Vector graphics (SVG-like content in the PDF) are harder than raster images. Most extractors rasterize them to PNG. That's acceptable but costs some fidelity.
  • Caption association (matching "Figure 3" text to the figure itself) requires reading-order plus proximity heuristics. Not all extractors do this well.

How to detect: your Markdown output has zero ![...](...) references, or the references point at nowhere-files. Either way, your figures are gone.

Putting it together: what good looks like

A PDF that converts well for a RAG pipeline produces Markdown that reads like someone typed it carefully. Specifically:

  • Headings are # / ## / ### Markdown headings in their correct hierarchy.
  • Paragraphs are separated by blank lines, reading order matches the visual layout, columns are serialized correctly.
  • Tables are Markdown tables with explicit | cell boundaries.
  • Images are ![caption](url) references pointing at real hosted files.
  • Footnotes either live in a dedicated section or use Markdown footnote syntax.
  • Scanned pages produce real OCRed text, not empty strings.
  • Math (where relevant) is $inline$ or $$display$$ LaTeX.

That's the target. The gap between "raw text dump" and "clean Markdown" is the entire difference between a RAG pipeline that works and one that silently returns garbage.

How to test any PDF converter in 10 minutes

Don't take anyone's word for which tool is best. Do this:

  1. Pick 5 documents from your corpus that represent the range (one easy, one hard, one with tables, one scanned, one with images).
  2. Convert each with the candidate tool.
  3. Open the output and grep for the five failure modes above: split sentences, flattened tables, empty pages, footnote interleaving, missing image references.
  4. Score each document 0-5 on "would this Markdown work for RAG?" Be honest.
  5. Pick the tool with the best aggregate score for your corpus. Not for generic benchmarks.

If you're looking for somewhere to start, Ilmenite's PDF engine fixes all five of these by default in its Standard tier and above. The head-to-head comparison with Marker and Firecrawl has real numbers on real documents.

The general rule: if you're embedding raw text directly from a PDF into a vector store, your RAG pipeline is broken in at least three of the five ways above. You just don't know which three yet.