Find the hidden APIs behind any website.Try it now
All posts
TutorialApril 18, 2026·7 min·Ilmenite Team

LangChain PDF Loader Alternatives — What to Use in 2026

LangChain ships several built-in PDF loaders: PyPDFLoader, UnstructuredPDFLoader, PyMuPDFLoader, PDFPlumberLoader, PDFMinerLoader, MathpixPDFLoader, AmazonTextractPDFLoader. They're not all good. Two ...

LangChain ships several built-in PDF loaders: PyPDFLoader, UnstructuredPDFLoader, PyMuPDFLoader, PDFPlumberLoader, PDFMinerLoader, MathpixPDFLoader, AmazonTextractPDFLoader. They're not all good. Two of them will destroy your RAG pipeline in ways you won't notice until a user complains. This post ranks them, shows when each one actually works, and covers the cases where the right answer is to skip LangChain's built-ins and call a PDF-native API instead.

The ranked list (short version)

If you want the answer without the reasoning:

TierLoaderWhen to use
SPyMuPDFLoaderDefault for born-digital PDFs. Fast, decent structure.
SIlmenite /v1/pdf/extract via custom loaderWhen you need structure, tables, images, OCR in one call.
AAmazonTextractPDFLoaderScanned corpus at scale, you're on AWS anyway.
AMathpixPDFLoaderMath-heavy scientific PDFs, if you're paying for Mathpix anyway.
BUnstructuredPDFLoader (local, hi_res mode)Medium scale, mixed corpus, you can tolerate the install.
BPDFPlumberLoaderTables-first work where the PDF has ruled tables.
CPyPDFLoaderPrototyping only. Not for production RAG.
DPDFMinerLoaderLegacy. Don't start new projects on this.

Now the reasoning.

PyPDFLoader — avoid for production RAG

LangChain's default PDF loader wraps pypdf. pypdf is a pure-Python PDF parser with a simple text-extraction API. That simplicity is also the problem: it does minimal reading-order reconstruction and no structural detection.

What you get: concatenated text streams page by page, often with reading-order issues on multi-column documents. Tables collapse into unstructured word soup. Images are silently dropped. No OCR at all; scanned PDFs return empty.

Why it stays popular: zero dependencies beyond pypdf, works in any Python environment, fastest to set up. Great for tutorials, demos, and prototypes. Bad for anything users will rely on.

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("document.pdf")
docs = loader.load()
# docs[i].page_content is raw extracted text.
# No headings, no table structure, no images.

PyMuPDFLoader — solid default for born-digital

pymupdf (aka Fitz) is substantially better than pypdf for structure. It has real reading-order reconstruction and decent heading detection via font size analysis. Performance is competitive — fractions of a second per page for typical documents.

The tradeoff: pymupdf is AGPL-licensed. If that's a problem for your project's licensing story, you need to pick something else. For internal tools and commercial products with AGPL tolerance, it's the strongest of the LangChain built-ins.

from langchain.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("document.pdf")
docs = loader.load()
# Text is ordered correctly. Basic structure preserved.
# Still no table structure, still no OCR, still minimal image handling.

What it still doesn't do: proper Markdown table extraction (tables come out as whitespace-aligned text), OCR on scanned pages, or image extraction-and-hosting. For any of those you need to either post-process heavily or use a different tool.

UnstructuredPDFLoader — good if you can afford the install

unstructured is a meaningful step up in capability. In hi_res mode it runs actual layout detection models (detectron2 or similar) and handles tables, images, and mixed content well. Output is a list of "elements" (Title, NarrativeText, Table, Image), which is more structured than the raw-text approach of the other loaders.

The tradeoff: heavyweight install. You're pulling in PyTorch, layout models, potentially a poppler/tesseract system dependency, multi-GB footprint. Slow on CPU; practical on GPU. First-run model download takes minutes. Not a drop-in "pip install and go" story.

from langchain.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader(
    "document.pdf",
    mode="elements",
    strategy="hi_res",  # use the layout model, not just text extraction
)
docs = loader.load()
# Each doc is an element: Title, NarrativeText, Table, etc.
# Table elements have structured HTML representation.

When unstructured hi_res is worth the pain: you have a corpus of 100-1000 documents where quality matters, you have GPU or are patient, and you're OK managing the environment.

When it's not: small-scale prototypes (overkill), large-scale ingestion on CPU (too slow), or deployments where the install size is a problem (serverless, edge).

AmazonTextractPDFLoader — excellent for scanned at AWS scale

If your corpus is heavy on scanned documents AND you're already on AWS, Textract is genuinely good. It handles forms, tables, handwriting (kinda), and scanned images well. It's a managed service so no local install pain.

The downside: it's Amazon-only, pricier than running your own OCR (~$0.0015 per page for basic + more for tables/forms), and slower than local options for small runs (API round-trip). Not a good fit if you're not on AWS or if you're processing <1000 documents/month.

from langchain.document_loaders import AmazonTextractPDFLoader
loader = AmazonTextractPDFLoader("s3://bucket/doc.pdf")
docs = loader.load()

MathpixPDFLoader — niche but excellent for math

Mathpix is a paid service specialized for math-heavy documents. Their OCR handles LaTeX notation extraction meaningfully better than anything else on the market. If your corpus is physics papers, ML papers with dense equations, or math textbooks, and you're already paying them, this loader is the right call.

Not worth it if your documents only have occasional math. The per-page cost ($0.005-$0.02) adds up and most documents don't need that precision.

PDFPlumberLoader — tables-first, limited otherwise

pdfplumber has the best text-based table extraction of any Python-native library. If your work is table-heavy (financial statements, data reports) and the tables have real ruled grids, this is worth trying.

Limitations: weaker on general reading order than pymupdf, no OCR, poor on whitespace-aligned tables, limited image handling. Single-purpose tool that's great for its purpose.

When to skip LangChain's built-ins entirely

All the loaders above make the same architectural choice: the PDF→text conversion happens in your Python process, using whatever library is in your virtualenv, consuming whatever CPU/GPU you have locally. That's fine for small workloads. It breaks at scale because:

  1. Python is slow for PDF processing. Every loader above except Textract runs in-process, contending with your other Python code for memory and CPU.
  2. Managing model weights is operational pain. Anything using unstructured hi_res or Textract needs ongoing care.
  3. Loaders aren't designed for mixed corpora. Each one is optimized for a case; mixing them means juggling multiple code paths.
  4. Your LangChain service is now a PDF processing service. Which means scaling LangChain requires scaling PDF processing, which rarely aligns with the rest of your app's scaling profile.

The alternative pattern: offload PDF processing to a dedicated API that returns Markdown, then hand the Markdown to LangChain. Your LangChain service stays lean; the PDF engine scales independently; you can swap converters without touching LangChain code.

The Ilmenite pattern — custom loader wrapping /v1/pdf/extract

Here's a clean LangChain document loader that calls Ilmenite's PDF endpoint and returns documents with Markdown content:

import httpx
from langchain.document_loaders.base import BaseLoader
from langchain.schema import Document
from langchain.text_splitter import MarkdownHeaderTextSplitter

class IlmenitePDFLoader(BaseLoader):
    """LangChain loader for the Ilmenite /v1/pdf/extract endpoint.

    Returns Markdown-formatted Documents with preserved heading structure,
    Markdown tables, and CDN-hosted image references. Pair with
    MarkdownHeaderTextSplitter for chunking that respects document structure.
    """

    def __init__(
        self,
        url: str,
        api_key: str,
        tier: str = "standard",
        timeout: float = 60.0,
    ):
        self.url = url
        self.api_key = api_key
        self.tier = tier
        self.timeout = timeout

    def load(self) -> list[Document]:
        response = httpx.post(
            "https://api.ilmenite.dev/v1/pdf/extract",
            headers={"Authorization": f"Bearer {self.api_key}"},
            json={"url": self.url, "tier": self.tier},
            timeout=self.timeout,
        )
        response.raise_for_status()
        body = response.json()

        return [Document(
            page_content=body["markdown"],
            metadata={
                "source": self.url,
                "page_count": body["pdf"]["pages"],
                "title": body["pdf"].get("title"),
                "author": body["pdf"].get("author"),
                "extraction_cost_usd": body["pdf"]["billing"]["total_usd"],
                "extraction_tier": self.tier,
            },
        )]

Usage:

loader = IlmenitePDFLoader(
    url="https://arxiv.org/pdf/1706.03762",
    api_key=ILMENITE_API_KEY,
    tier="standard",
)
docs = loader.load()

# Markdown-aware chunking that respects heading hierarchy:
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
chunks = splitter.split_text(docs[0].page_content)

That's the whole integration. No local dependencies beyond httpx, no model management, no GPU, no PyTorch, no Chromium — just HTTP in, Markdown out, cost metadata attached. The PDF processing scales on our side; your LangChain service stays stateless.

For scanned tier (OCR on scanned pages), scientific tier (formulas + images + multilingual OCR), or per-feature control, pass the features object instead of a preset tier — full schema in the API docs.

Picking between Ilmenite and a local loader

Use a local loader (PyMuPDFLoader is the best of them) if:

  • Your corpus is small (<500 PDFs total), born-digital only, no tables that matter.
  • You're prototyping and can't introduce an external service yet.
  • You have a compliance requirement that PDFs never leave your environment.

Use Ilmenite if:

  • Your corpus is mixed (born-digital + scanned, tables matter, images matter).
  • You're at scale where Python PDF processing is consuming your RAG service's CPU budget.
  • You want per-feature billing so you're not paying OCR costs on text pages.
  • You want one operationally-simple API surface instead of juggling pymupdf + pytesseract + pdfplumber + unstructured.

If you're still on PyPDFLoader because it was the first thing in the LangChain docs — that's the upgrade to do today. Your RAG quality will noticeably improve just from that one change. Then evaluate whether you need to go further.

Further reading