The State of PDF Parsing: What 800+ Documents and 7 Frontier LLMs Taught Us About Parser Selection

Dec 2, 2025 16 min read

We tested 17 PDF parsers on 800+ documents—open source, commercial APIs, and frontier LLMs. Parser accuracy varies 55+ points by domain. Gemini 3 Pro leads at 88%, but LlamaParse at $0.003/page is the sweet spot for most use cases.

The State of PDF Parsing: What 800+ Documents Taught Us About Parser Selection

There is no "best" PDF parser. The right choice depends on your documents, your budget, and whether you need structure or just text.


The Problem: Parser Selection Is an Optimization Problem

If you've ever tried to select a PDF parser for a production pipeline, you know the challenge. The landscape offers numerous options, each with carefully curated benchmarks that favor their approach.

The choices are numerous. Our evaluation covered 17 parsers: 11 open source (pypdf, pymupdf, pdfplumber, docling, marker, and others), 4 commercial APIs (AWS Textract, Azure Document Intelligence, Google Document AI, LlamaParse), and frontier LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5). Each has a different philosophy, different strengths, and different failure modes. Some prioritize speed. Others prioritize accuracy. Some handle tables well but struggle with layout. Others excel at OCR but lose document structure.

Vendor benchmarks are unreliable. Every parser vendor shows favorable performance data—on their chosen test sets. Academic benchmarks exist, but they typically use synthetic documents or narrow corpora that don't reflect real business documents: scanned contracts with coffee stains, invoices with variable layouts, regulatory filings with nested tables.

Switching costs are prohibitive. Once you've integrated a parser into your pipeline, changing it means rewriting extraction logic, revalidating output formats, and retraining downstream models. Most teams pick a parser early and stick with it—even when problems emerge—because the switching cost is too high.

We built PDFbench because we needed answers for our own document intelligence work—and because we suspected the conventional wisdom was wrong.


Our Approach: How We Built PDFbench

PDFbench is not an academic benchmark. It's a practitioner's tool, designed to answer the questions that actually matter when building document pipelines.

The Corpus: 800+ Real Documents

We assembled a corpus of 800+ documents across 6 domains:

Domain Documents Requires OCR
Legal Contracts (CUAD) 75 No
Legal Templates 108 No
Invoices 100 No
HR Documents 34 No
Synthetic (test docs) 31 No
OmniDocBench (academic, mixed) 252 Yes
Total 800+

This isn't a toy dataset. The CUAD legal contracts are real SEC filings, dense with legal language and variable formatting. The invoices come from multiple vendors with wildly different layouts. The OmniDocBench documents include academic papers, financial reports, textbooks, and research documents with complex visual elements.

The split: 358 digital PDFs (text extraction) and 252 scanned documents (OCR required). Most parsers were tested on 200-360 documents each.

The Parsers: 17 Options Evaluated

We tested across three categories:

Frontier LLMs (7) — Head-to-head on 30 documents:
- Premium: GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5
- Budget: GPT-4o-mini, Gemini 2.0 Flash, Claude 3.5 Haiku, LlamaParse

Commercial APIs (4) — Tested on 200+ documents:
- AWS Textract, Azure Document Intelligence, Google Document AI, Databricks

Open Source (6+) — Tested on full corpus:
- Text-focused: pypdf, pypdfium2, pymupdf, pdfplumber, pdfminer
- Structure-aware: docling, marker

The Metrics: Why One Number Isn't Enough

PDF parsing is actually five different problems masquerading as one:

  1. Text extraction — Did we get the right characters?
  2. OCR — Can we read scanned documents?
  3. Structure recovery — Did we preserve headings, lists, sections?
  4. Table extraction — Did we get tabular data right?
  5. Output quality — Is the markdown usable for downstream tasks?

A single accuracy number hides critical distinctions.

Core Metrics We Use

Metric Measures Why It Matters
Edit similarity Character-level text accuracy Core extraction quality
ChrF++ Character n-gram F-score Robustness to minor errors
Tree Edit Distance Document structure as AST RAG and LLM pipeline quality
TEDS Table structure accuracy Financial and structured data
Pairwise Ordering Reading sequence correctness Multi-column coherence

Tree Edit Distance (TED) deserves special attention. We parse both the predicted and ground-truth Markdown into Abstract Syntax Trees using the CommonMark specification. The AST represents document structure as a tree: a section header contains paragraphs, which contain inline elements; a list contains list items, which contain paragraphs. The edit distance—minimum node insertions, deletions, and renames to transform one tree into another—captures structural hallucinations that text-level metrics miss.

A parser can ace text extraction (80%) and fail structure recovery (40%). That gap matters enormously for RAG pipelines, where document structure determines chunk boundaries and retrieval quality.


Key Finding 1: The Frontier LLM Landscape

Our benchmark reveals a nuanced quality hierarchy. On our 30-document comparison set—covering synthetic, academic, legal, invoice, and resume documents—we tested all 7 frontier models head-to-head:

Category Parser Edit Similarity ChrF++ Cost/Doc Tree Sim
Premium LLM Gemini 3 Pro 88% 77% $0.010 42%
Premium LLM GPT-5.1 84% 79% $0.036 42%
Premium LLM Claude Sonnet 4.5 78% 77% $0.058 38%
Budget LLM LlamaParse 78% 81% $0.003 39%
Budget LLM Gemini 2.0 Flash 78% 78% ~$0.001 35%
Budget LLM GPT-4o-mini 75% 75% ~$0.001 13%
Budget LLM Claude 3.5 Haiku 69% 65% ~$0.001 33%

Note: 30 documents tested across 5 domains (10 synthetic, 5 academic, 5 legal, 5 invoices, 5 resumes). All parsers evaluated on identical documents for fair comparison.

Key observations:

Gemini 3 Pro leads on text extraction at 88% edit similarity—6 points ahead of the budget tier. But the gap is smaller than we expected. There's no 90%+ parser in our tests.

LlamaParse leads on robustness (ChrF++) at 81%, slightly ahead of GPT-5.1 at 79%. ChrF++ measures character n-gram overlap and is more forgiving of minor formatting differences.

GPT-4o-mini is a structure destroyer. Despite acceptable 75% text scores, it achieves only 13% tree similarity—less than half of any other parser. For RAG pipelines, this model is actively harmful.

Cost varies 60x. Claude Sonnet 4.5 costs $0.058/doc versus ~$0.001/doc for budget models. At 100,000 documents/month, that's $5,800 versus $100.

The practical takeaway: Premium LLMs offer modest quality improvements (6-10 points) at substantial cost increases (10-60x). For most use cases, budget LLMs or LlamaParse deliver better value.


Key Finding 2: LlamaParse Is the Sweet Spot

LlamaParse occupies a unique position in the quality/cost landscape—and in our head-to-head tests, it actually leads on robustness metrics:

  • 81% ChrF++ — highest among all 7 parsers tested
  • 78% edit similarity — matches Gemini 2.0 Flash and Claude Sonnet 4.5
  • 39% tree similarity — solid structure preservation, well above GPT-4o-mini
  • $0.003 per page — 12x cheaper than GPT-5.1, 19x cheaper than Claude Sonnet 4.5
  • Purpose-built for PDF — not a general LLM doing PDF on the side

For most use cases, LlamaParse offers the best quality/cost ratio. It matches or exceeds premium LLM quality on robustness metrics while costing 10-20x less.

When to choose LlamaParse:
- You want premium-tier results at budget-tier cost
- Your volume is moderate (thousands to tens of thousands of pages monthly)
- You need reliable structure preservation for RAG pipelines

When to choose something else:
- You need maximum text fidelity (use Gemini 3 Pro at 88% edit similarity)
- You process millions of pages monthly (use open source)
- You need specific capabilities like table extraction (use pdfplumber)


Key Finding 3: The 55-Point Domain Gap

Parser accuracy varies by 55+ percentage points depending on document type—a gap that dwarfs the differences between parsers on any single domain.

Domain Best Parser Edit Sim Worst Parser Edit Sim Gap
Legal Contracts Gemini Flash 95% Haiku 55% 40pt
Resumes Haiku 92% GPT-4o-mini 88% 4pt
Synthetic Gemini 3 Pro 93% GPT-4o-mini 86% 7pt
Invoices Haiku 80% GPT-4o-mini 74% 6pt
Academic Papers Gemini 3 Pro 60% Haiku 8% 52pt

Legal contracts are easy. On CUAD legal documents, the best parsers achieve 93-95% edit similarity. Well-formatted, standard fonts, consistent layouts. Parser choice matters less here.

Academic papers are genuinely hard. ArXiv papers with equations, figures, and complex layouts challenge every parser. Even Gemini 3 Pro—the leader—achieves only 60%. Claude 3.5 Haiku collapses to 8%, essentially failing on this document type.

The 55-point spread between legal (95%) and academic (40%) swamps the differences between parsers on any single domain.

If you're processing a homogeneous document type—say, contracts from your own legal templates—almost any parser will work. But if you're building a pipeline that handles mixed document types, parser selection becomes a portfolio decision. You may need different parsers for different document classes, or a triage system that routes documents to specialized extractors.


Key Finding 4: Academic Papers Break Every Parser

This finding was stark. Academic papers—ArXiv submissions with equations, figures, multi-column layouts, and complex formatting—challenged every parser we tested.

Parser Academic Edit Sim Overall Edit Sim Gap
Gemini 3 Pro 60% 88% -28pt
GPT-5.1 39% 84% -45pt
LlamaParse 38% 78% -40pt
Claude Sonnet 4.5 34% 78% -44pt
Gemini 2.0 Flash 34% 78% -44pt
GPT-4o-mini 34% 75% -41pt
Claude 3.5 Haiku 8% 69% -61pt

What makes academic papers hard:

Mathematical notation. LaTeX equations, subscripts, superscripts, and Greek letters don't survive most parsing pipelines. Even when text is extracted, the semantic meaning is lost.

Multi-column layouts. Two-column academic formatting confuses reading order. Parsers often interleave columns or fragment paragraphs.

Figures and captions. Charts, diagrams, and their captions are integral to academic content but poorly handled by text-focused parsers.

Dense, specialized formatting. References, footnotes, abstracts, and section hierarchies follow conventions that parsers don't recognize.

The Haiku collapse. Claude 3.5 Haiku—which performs adequately on other document types—achieves only 8% on academic papers. This isn't gradual degradation; it's near-total failure.

The implication: If your use case involves academic or technical documents, benchmark extensively before committing. The parser that works well on contracts may fail catastrophically on papers.


Key Finding 5: Text Accuracy Doesn't Mean Structure Quality

Text extraction accuracy and structure recovery are largely independent. Our 30-document comparison reveals dramatic gaps:

Parser Edit Similarity Tree Similarity Gap
GPT-5.1 84% 42% 42pt
Gemini 3 Pro 88% 42% 46pt
LlamaParse 78% 39% 39pt
Claude Sonnet 4.5 78% 38% 40pt
Gemini 2.0 Flash 78% 35% 43pt
Claude 3.5 Haiku 69% 33% 36pt
GPT-4o-mini 75% 13% 62pt

The GPT-4o-mini problem: This model scores 75% on text extraction—acceptable at first glance. But it achieves only 13% tree similarity, versus 33-42% for other parsers. It extracts text while destroying document structure. For RAG pipelines where structure determines chunk boundaries, GPT-4o-mini is actively harmful despite reasonable text scores.

Why structure matters for AI pipelines:

  • Chunk boundaries. Where does one section end and another begin? Poor structure recovery means poor chunking, which means irrelevant retrieval.
  • Heading hierarchy. What's a main section vs. a subsection? Lost hierarchy means lost context.
  • List and table integrity. Are those three items a list, or three separate paragraphs? The distinction matters for downstream processing.

The takeaway: If you're building RAG pipelines, don't rely on text accuracy benchmarks alone. Test structure recovery explicitly. GPT-4o-mini's 62-point gap between text and structure scores makes it unsuitable for most AI applications despite acceptable text extraction.


Key Finding 6: The Metric That Lies

Here's a counterintuitive result: open-source parsers score 90+ on ChrF++ but only 70s on edit similarity. How can the same parser score 20 points higher on one metric?

Parser Edit Similarity ChrF++ Gap
pypdfium2 78% 90.4 +12
pypdf 78% 90.4 +12
pymupdf 77% 90.5 +13
pdfplumber 70% 91.6 +22

What's happening: ChrF++ measures character n-gram overlap—whether the right characters appear, regardless of order or structure. Edit similarity measures the minimum edits needed to transform output into ground truth—it penalizes reordering, missing whitespace, and structural changes.

A parser can extract all the right characters (high ChrF++) while scrambling their order or losing structure (lower edit similarity).

The implication: Don't trust single-metric benchmarks. A parser with 90+ ChrF++ might still break your pipeline if it destroys reading order. For RAG applications, edit similarity and tree similarity matter more than character overlap.


Visualizing Trade-offs: The Pareto Front

A single leaderboard ranking hides important choices. We recommend visualizing parser performance as a Pareto front:

Parser performance trade-offs

This visualization immediately reveals that "best parser" questions are incomplete. The right question is: "Which parser is best given my speed/accuracy/OCR constraints?"


Decision Framework: How to Select a Parser

Based on our findings, here's a practical framework for parser selection:

Question 1: What's Your Quality Requirement?

If You Need Choose Why
Maximum text fidelity (88%+) Gemini 3 Pro Highest edit similarity in our tests
High quality (84%+) GPT-5.1 Strong overall, good structure
Best value (78-81%) LlamaParse Highest ChrF++, 12x cheaper than premium
Budget option (75-78%) Gemini 2.0 Flash Matches LlamaParse quality, lower cost
Avoid GPT-4o-mini Destroys structure (13% tree similarity)

Note: No parser in our tests exceeded 88% edit similarity. Claims of 90%+ should be scrutinized.

Question 2: What's Your Budget?

Monthly Volume Recommended Approach Estimated Cost
<1,000 docs Gemini 3 Pro or GPT-5.1 <$40/month
1,000-10,000 docs LlamaParse $3-30/month
10,000-100,000 docs LlamaParse or Gemini Flash $30-100/month
>100,000 docs Open source Compute only

Question 3: What Document Types?

Document Type Best Parser Accuracy Notes
Legal contracts Gemini Flash 95% Most parsers work well
Resumes/HR docs Any 88-92% Low variance between parsers
Invoices Haiku 80% Moderate difficulty
Academic papers Gemini 3 Pro 60% Hard—expect degradation

Critical warning: If your corpus includes academic/technical papers, benchmark extensively. Claude 3.5 Haiku achieves only 8% on academic content despite 69% overall.

Question 4: Do You Need Structure for RAG?

Structure preservation varies dramatically:

Parser Tree Similarity RAG Suitability
GPT-5.1 42% Good
Gemini 3 Pro 42% Good
LlamaParse 39% Good
Claude Sonnet 4.5 38% Good
Gemini 2.0 Flash 35% Acceptable
Claude 3.5 Haiku 33% Acceptable
GPT-4o-mini 13% Avoid

Avoid GPT-4o-mini for RAG pipelines. Its 13% tree similarity means it destroys document structure while extracting text.

Quick Reference Matrix

Use Case Recommended Alternative Avoid
Max quality Gemini 3 Pro GPT-5.1 Haiku
Quality + cost balance LlamaParse Gemini 2.0 Flash GPT-4o-mini
High-volume processing Open source Gemini Flash Premium LLMs
RAG pipelines LlamaParse GPT-5.1 GPT-4o-mini
Academic papers Gemini 3 Pro GPT-5.1 Haiku
Legal contracts Any

Why Not Just Use Gemini 3 Pro for Everything?

A reasonable question: if Gemini 3 Pro achieves 88% edit similarity—the highest in our tests—why consider anything else?

The quality gap is modest. Gemini 3 Pro's 88% vs LlamaParse's 78% is a 10-point gap. On robustness metrics (ChrF++), LlamaParse actually leads at 81%. The premium tier offers incremental improvement, not transformative quality.

Cost at scale. Processing 100,000 documents monthly costs ~$1,000 with Gemini 3 Pro versus ~$300 with LlamaParse. At a million documents, the gap is $10,000 versus $3,000.

Domain-specific failures. Even Gemini 3 Pro drops to 60% on academic papers. Premium pricing doesn't buy immunity from document complexity.

Latency. Frontier LLMs take 10-30 seconds per page. For real-time applications, this may be prohibitive.

Consistency. LLMs exhibit variance in structure extraction. The same document parsed twice may produce different markdown structures. For pipelines requiring deterministic output, this variance is a problem.

The bottom line: Gemini 3 Pro leads our benchmarks, but the gap to budget alternatives is smaller than marketing suggests. Match your parser to your requirements, not to leaderboard positions.


Managing Parser Complexity: The PDFsmith Approach

One pattern emerged clearly from our benchmarking work: the switching cost problem is as significant as the parser selection problem. To manage this complexity, we standardized on a unified interface.

PDFsmith is the open-source library we built for this purpose. It provides a single API to 15+ parser backends:

from pdfsmith import parse

# Auto-select best available backend
markdown = parse("document.pdf")

# Use specific backend
markdown = parse("contract.pdf", backend="pypdfium2")
markdown = parse("academic.pdf", backend="marker")
markdown = parse("tables.pdf", backend="pdfplumber")

PDFsmith doesn't improve the underlying parsers; it eliminates the friction of switching between them. When your requirements change—or when you discover that your current parser struggles with a document class—you can swap backends without rewriting integration code.


Conclusion

We set out to answer a simple question: which PDF parser should we use? We learned the question itself was flawed.

There is no "best" parser. Parser performance varies dramatically by document type, required capabilities, cost constraints, and quality dimension. The 55-point accuracy gap between easy domains (legal at 95%) and hard domains (academic at 40%) dwarfs the 10-point gap between premium and budget LLMs.

The practical lessons:

Know your documents. Before selecting a parser, understand your document portfolio. Legal contracts? Most parsers work fine. Academic papers? Even the best parser achieves only 60%. Domain determines everything.

Know your constraints. Quality, cost, speed, structure preservation—you can't optimize all of them. Decide which matter most for your use case.

Consider LlamaParse as your default. It leads on robustness (ChrF++) while costing 10-20x less than premium LLMs. For most use cases, it's the right choice.

Test structure, not just text. GPT-4o-mini scores 75% on text but only 13% on structure. For RAG pipelines, that's a deal-breaker hidden by headline metrics.

Plan for hard documents. Academic papers, complex forms, and multi-column layouts remain genuinely challenging. No parser handles them well. Build your quality controls accordingly.

Beware of inflated benchmarks. We found no parser exceeding 88% edit similarity on our diverse corpus. Claims of 90%+ should be scrutinized against your actual document types.

We built PDFbench because we needed answers for our own work. The benchmark data, methodology documentation, and PDFsmith library are available for review and evaluation.


References

Benchmarks & Evaluation Frameworks:
- Ouyang et al. (2025). "OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations." CVPR 2025.
- Pfitzmann et al. (2022). "DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis." arXiv:2206.01062.
- Zhong et al. (2020). "Image-Based Table Recognition: Data, Model, and Evaluation." PubTabNet. ECCV 2020.

Metrics & Methodologies:
- TEDS (Tree-Edit-Distance-Similarity): Standard for table structure evaluation, introduced with PubTabNet.
- ChrF++: Character n-gram metric for robust text comparison.

Our Work:
- Applied AI. (2025). PDFbench: PDF Parser Benchmark Suite. Open source evaluation framework.
- Applied AI. (2025). PDFsmith: Unified PDF Parser Library. Open source.


Applied AI specializes in document intelligence systems that actually work in production. If you're building document processing pipelines and want to discuss your specific requirements, [contact us].

PDF Parsing Benchmark - Accuracy by Domain (800+ documents)
PDF Parsing Benchmark: Accuracy by domain across 800+ documents.
LinkedIn

Ready to implement?

Let's discuss how these patterns apply to your organization.