The State of PDF Parsing: What 800+ Documents and 7 Frontier LLMs Taught Us About Parser Selection

There is no "best" PDF parser. The right choice depends on your documents, your budget, and whether you need structure or just text.

Executive Summary

Organizations building document intelligence pipelines face an overwhelming array of choices: 17+ PDF parsers spanning open source libraries, commercial APIs, and now frontier LLMs that can "see" documents directly. Vendor benchmarks conflict. Academic benchmarks use synthetic documents. We built PDFbench to cut through the noise. After testing 17 parsers on 800+ documents across 6 domains—with rigorous head-to-head comparison on a 30-document subset where all 7 frontier models were evaluated—our findings challenge common assumptions: Domain determines everything. Parser accuracy varies by 55+ percentage points depending on document type. Legal contracts hit 95% accuracy; academic papers struggle at 40-60%. The gap between easy and hard document types dwarfs differences between parsers. The premium LLM gap is smaller than expected. On our 30-document comparison set, Gemini 3 Pro leads at 88% edit similarity, followed by GPT-5.1 at 84%. The gap between premium LLMs and budget options is 6-10 points—meaningful but not transformative. LlamaParse is the sweet spot. At 78% edit similarity and $0.003 per page, LlamaParse matches premium LLM quality at 10-20x lower cost. For most use cases, it's the right default. Text accuracy doesn't mean structure quality. A parser can extract text at 75% accuracy while recovering structure at only 13%. GPT-4o-mini exhibits exactly this failure mode—acceptable text scores with catastrophic structure preservation. The practical implication: stop optimizing for the "best" parser and start optimizing for your specific document domain and quality/cost requirements. This paper presents our methodology, key findings, and a decision framework for parser selection.

The Problem: Parser Selection Is an Optimization Problem

If you've ever tried to select a PDF parser for a production pipeline, you know the challenge. The landscape offers numerous options, each with carefully curated benchmarks that favor their approach. The choices are numerous. Our evaluation covered 17 parsers: 11 open source (pypdf, pymupdf, pdfplumber, docling, marker, and others), 4 commercial APIs (AWS Textract, Azure Document Intelligence, Google Document AI, LlamaParse), and frontier LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5). Each has a different philosophy, different strengths, and different failure modes. Some prioritize speed. Others prioritize accuracy. Some handle tables well but struggle with layout. Others excel at OCR but lose document structure. Vendor benchmarks are unreliable. Every parser vendor shows favorable performance data—on their chosen test sets. Academic benchmarks exist, but they typically use synthetic documents or narrow corpora that don't reflect real business documents: scanned contracts with coffee stains, invoices with variable layouts, regulatory filings with nested tables. Switching costs are prohibitive. Once you've integrated a parser into your pipeline, changing it means rewriting extraction logic, revalidating output formats, and retraining downstream models. Most teams pick a parser early and stick with it—even when problems emerge—because the switching cost is too high. We built PDFbench because we needed answers for our own document intelligence work—and because we suspected the conventional wisdom was wrong.

Our Approach: How We Built PDFbench

PDFbench is not an academic benchmark. It's a practitioner's tool, designed to answer the questions that actually matter when building document pipelines.

The Corpus: 800+ Real Documents

We assembled a corpus of 800+ documents across 6 domains:

Domain	Documents	Requires OCR
Legal Contracts (CUAD)	75	No
Legal Templates	108	No
Invoices	100	No
HR Documents	34	No
Synthetic (test docs)	31	No
OmniDocBench (academic, mixed)	252	Yes
Total	800+

This isn't a toy dataset. The CUAD legal contracts are real SEC filings, dense with legal language and variable formatting. The invoices come from multiple vendors with wildly different layouts. The OmniDocBench documents include academic papers, financial reports, textbooks, and research documents with complex visual elements. The split: 358 digital PDFs (text extraction) and 252 scanned documents (OCR required). Most parsers were tested on 200-360 documents each.

The Parsers: 17 Options Evaluated

We tested across three categories: Frontier LLMs (7) — Head-to-head on 30 documents:

Premium: GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5
Budget: GPT-4o-mini, Gemini 2.0 Flash, Claude 3.5 Haiku, LlamaParse

Commercial APIs (4) — Tested on 200+ documents:

AWS Textract, Azure Document Intelligence, Google Document AI, Databricks

Open Source (6+) — Tested on full corpus:

Text-focused: pypdf, pypdfium2, pymupdf, pdfplumber, pdfminer
Structure-aware: docling, marker

The Metrics: Why One Number Isn't Enough

PDF parsing is actually five different problems masquerading as one: 1. Text extraction — Did we get the right characters? 2. OCR — Can we read scanned documents? 3. Structure recovery — Did we preserve headings, lists, sections? 4. Table extraction — Did we get tabular data right? 5. Output quality — Is the markdown usable for downstream tasks? A single accuracy number hides critical distinctions. This multi-dimensional approach to evaluation reflects broader principles in The LLM Evaluation Gap.

#### Core Metrics We Use

Metric	Measures	Why It Matters
Edit similarity	Character-level text accuracy	Core extraction quality
ChrF++	Character n-gram F-score	Robustness to minor errors
Tree Edit Distance	Document structure as AST	RAG and LLM pipeline quality
TEDS	Table structure accuracy	Financial and structured data
Pairwise Ordering	Reading sequence correctness	Multi-column coherence

Tree Edit Distance (TED) deserves special attention. We parse both the predicted and ground-truth Markdown into Abstract Syntax Trees using the CommonMark specification. The AST represents document structure as a tree: a section header contains paragraphs, which contain inline elements; a list contains list items, which contain paragraphs. The edit distance—minimum node insertions, deletions, and renames to transform one tree into another—captures structural hallucinations that text-level metrics miss. A parser can ace text extraction (80%) and fail structure recovery (40%). That gap matters enormously for RAG pipelines, where document structure determines chunk boundaries and retrieval quality. (See Enterprise RAG Architecture for how parsing quality impacts downstream retrieval.)

Key Finding 1: The Frontier LLM Landscape

Our benchmark reveals a nuanced quality hierarchy. On our 30-document comparison set—covering synthetic, academic, legal, invoice, and resume documents—we tested all 7 frontier models head-to-head:

Category	Parser	Edit Similarity	ChrF++	Cost/Doc	Tree Sim
Premium LLM	Gemini 3 Pro	88%	77%	$0.010	42%
Premium LLM	GPT-5.1	84%	79%	$0.036	42%
Premium LLM	Claude Sonnet 4.5	78%	77%	$0.058	38%
Budget LLM	LlamaParse	78%	81%	$0.003	39%
Budget LLM	Gemini 2.0 Flash	78%	78%	~$0.001	35%
Budget LLM	GPT-4o-mini	75%	75%	~$0.001	13%
Budget LLM	Claude 3.5 Haiku	69%	65%	~$0.001	33%

Note: 30 documents tested across 5 domains (10 synthetic, 5 academic, 5 legal, 5 invoices, 5 resumes). All parsers evaluated on identical documents for fair comparison. Key observations: Gemini 3 Pro leads on text extraction at 88% edit similarity—6 points ahead of the budget tier. But the gap is smaller than we expected. There's no 90%+ parser in our tests. LlamaParse leads on robustness (ChrF++) at 81%, slightly ahead of GPT-5.1 at 79%. ChrF++ measures character n-gram overlap and is more forgiving of minor formatting differences. GPT-4o-mini is a structure destroyer. Despite acceptable 75% text scores, it achieves only 13% tree similarity—less than half of any other parser. For RAG pipelines, this model is actively harmful. Cost varies 60x. Claude Sonnet 4.5 costs $0.058/doc versus ~$0.001/doc for budget models. At 100,000 documents/month, that's $5,800 versus $100. The practical takeaway: Premium LLMs offer modest quality improvements (6-10 points) at substantial cost increases (10-60x). For most use cases, budget LLMs or LlamaParse deliver better value.

Key Finding 2: LlamaParse Is the Sweet Spot

LlamaParse occupies a unique position in the quality/cost landscape—and in our head-to-head tests, it actually leads on robustness metrics:

81% ChrF++ — highest among all 7 parsers tested
78% edit similarity — matches Gemini 2.0 Flash and Claude Sonnet 4.5
39% tree similarity — solid structure preservation, well above GPT-4o-mini
$0.003 per page — 12x cheaper than GPT-5.1, 19x cheaper than Claude Sonnet 4.5
Purpose-built for PDF — not a general LLM doing PDF on the side

For most use cases, LlamaParse offers the best quality/cost ratio. It matches or exceeds premium LLM quality on robustness metrics while costing 10-20x less. When to choose LlamaParse:

You want premium-tier results at budget-tier cost
Your volume is moderate (thousands to tens of thousands of pages monthly)
You need reliable structure preservation for RAG pipelines

When to choose something else:

You need maximum text fidelity (use Gemini 3 Pro at 88% edit similarity)
You process millions of pages monthly (use open source)
You need specific capabilities like table extraction (use pdfplumber)

Key Finding 3: The 55-Point Domain Gap

Parser accuracy varies by 55+ percentage points depending on document type—a gap that dwarfs the differences between parsers on any single domain.

Domain	Best Parser	Edit Sim	Worst Parser	Edit Sim	Gap
Legal Contracts	Gemini Flash	95%	Haiku	55%	40pt
Resumes	Haiku	92%	GPT-4o-mini	88%	4pt
Synthetic	Gemini 3 Pro	93%	GPT-4o-mini	86%	7pt
Invoices	Haiku	80%	GPT-4o-mini	74%	6pt
Academic Papers	Gemini 3 Pro	60%	Haiku	8%	52pt

Legal contracts are easy. On CUAD legal documents, the best parsers achieve 93-95% edit similarity. Well-formatted, standard fonts, consistent layouts. Parser choice matters less here. Academic papers are genuinely hard. ArXiv papers with equations, figures, and complex layouts challenge every parser. Even Gemini 3 Pro—the leader—achieves only 60%. Claude 3.5 Haiku collapses to 8%, essentially failing on this document type. The 55-point spread between legal (95%) and academic (40%) swamps the differences between parsers on any single domain. If you're processing a homogeneous document type—say, contracts from your own legal templates—almost any parser will work. But if you're building a pipeline that handles mixed document types, parser selection becomes a portfolio decision. You may need different parsers for different document classes, or a triage system that routes documents to specialized extractors.

Key Finding 4: Academic Papers Break Every Parser

This finding was stark. Academic papers—ArXiv submissions with equations, figures, multi-column layouts, and complex formatting—challenged every parser we tested.

Parser	Academic Edit Sim	Overall Edit Sim	Gap
Gemini 3 Pro	60%	88%	-28pt
GPT-5.1	39%	84%	-45pt
LlamaParse	38%	78%	-40pt
Claude Sonnet 4.5	34%	78%	-44pt
Gemini 2.0 Flash	34%	78%	-44pt
GPT-4o-mini	34%	75%	-41pt
Claude 3.5 Haiku	8%	69%	-61pt

What makes academic papers hard: Mathematical notation. LaTeX equations, subscripts, superscripts, and Greek letters don't survive most parsing pipelines. Even when text is extracted, the semantic meaning is lost. Multi-column layouts. Two-column academic formatting confuses reading order. Parsers often interleave columns or fragment paragraphs. Figures and captions. Charts, diagrams, and their captions are integral to academic content but poorly handled by text-focused parsers. Dense, specialized formatting. References, footnotes, abstracts, and section hierarchies follow conventions that parsers don't recognize. The Haiku collapse. Claude 3.5 Haiku—which performs adequately on other document types—achieves only 8% on academic papers. This isn't gradual degradation; it's near-total failure. The implication: If your use case involves academic or technical documents, benchmark extensively before committing. The parser that works well on contracts may fail catastrophically on papers.

Key Finding 5: Text Accuracy Doesn't Mean Structure Quality

Text extraction accuracy and structure recovery are largely independent. Our 30-document comparison reveals dramatic gaps:

Parser	Edit Similarity	Tree Similarity	Gap
GPT-5.1	84%	42%	42pt
Gemini 3 Pro	88%	42%	46pt
LlamaParse	78%	39%	39pt
Claude Sonnet 4.5	78%	38%	40pt
Gemini 2.0 Flash	78%	35%	43pt
Claude 3.5 Haiku	69%	33%	36pt
GPT-4o-mini	75%	13%	62pt

The GPT-4o-mini problem: This model scores 75% on text extraction—acceptable at first glance. But it achieves only 13% tree similarity, versus 33-42% for other parsers. It extracts text while destroying document structure. For RAG pipelines where structure determines chunk boundaries, GPT-4o-mini is actively harmful despite reasonable text scores. Why structure matters for AI pipelines:

Chunk boundaries. Where does one section end and another begin? Poor structure recovery means poor chunking, which means irrelevant retrieval.
Heading hierarchy. What's a main section vs. a subsection? Lost hierarchy means lost context.
List and table integrity. Are those three items a list, or three separate paragraphs? The distinction matters for downstream processing.

The takeaway: If you're building RAG pipelines, don't rely on text accuracy benchmarks alone. Test structure recovery explicitly. GPT-4o-mini's 62-point gap between text and structure scores makes it unsuitable for most AI applications despite acceptable text extraction.

Key Finding 6: The Metric That Lies

Here's a counterintuitive result: open-source parsers score 90+ on ChrF++ but only 70s on edit similarity. How can the same parser score 20 points higher on one metric?

Parser	Edit Similarity	ChrF++	Gap
pypdfium2	78%	90.4	+12
pypdf	78%	90.4	+12
pymupdf	77%	90.5	+13
pdfplumber	70%	91.6	+22

What's happening: ChrF++ measures character n-gram overlap—whether the right characters appear, regardless of order or structure. Edit similarity measures the minimum edits needed to transform output into ground truth—it penalizes reordering, missing whitespace, and structural changes. A parser can extract all the right characters (high ChrF++) while scrambling their order or losing structure (lower edit similarity). The implication: Don't trust single-metric benchmarks. A parser with 90+ ChrF++ might still break your pipeline if it destroys reading order. For RAG applications, edit similarity and tree similarity matter more than character overlap.

Visualizing Trade-offs: The Pareto Front

A single leaderboard ranking hides important choices. We recommend visualizing parser performance as a Pareto front:

This visualization immediately reveals that "best parser" questions are incomplete. The right question is: "Which parser is best given my speed/accuracy/OCR constraints?"

Decision Framework: How to Select a Parser

Based on our findings, here's a practical framework for parser selection:

Question 1: What's Your Quality Requirement?

If You Need	Choose	Why
Maximum text fidelity (88%+)	Gemini 3 Pro	Highest edit similarity in our tests
High quality (84%+)	GPT-5.1	Strong overall, good structure
Best value (78-81%)	LlamaParse	Highest ChrF++, 12x cheaper than premium
Budget option (75-78%)	Gemini 2.0 Flash	Matches LlamaParse quality, lower cost
Avoid	GPT-4o-mini	Destroys structure (13% tree similarity)

Note: No parser in our tests exceeded 88% edit similarity. Claims of 90%+ should be scrutinized.

Question 2: What's Your Budget?

Monthly Volume	Recommended Approach	Estimated Cost
<1,000 docs	Gemini 3 Pro or GPT-5.1	<$40/month
1,000-10,000 docs	LlamaParse	$3-30/month
10,000-100,000 docs	LlamaParse or Gemini Flash	$30-100/month
>100,000 docs	Open source	Compute only

Question 3: What Document Types?

Document Type	Best Parser	Accuracy	Notes
Legal contracts	Gemini Flash	95%	Most parsers work well
Resumes/HR docs	Any	88-92%	Low variance between parsers
Invoices	Haiku	80%	Moderate difficulty
Academic papers	Gemini 3 Pro	60%	Hard—expect degradation

Critical warning: If your corpus includes academic/technical papers, benchmark extensively. Claude 3.5 Haiku achieves only 8% on academic content despite 69% overall.

Question 4: Do You Need Structure for RAG?

Structure preservation varies dramatically:

Parser	Tree Similarity	RAG Suitability
GPT-5.1	42%	Good
Gemini 3 Pro	42%	Good
LlamaParse	39%	Good
Claude Sonnet 4.5	38%	Good
Gemini 2.0 Flash	35%	Acceptable
Claude 3.5 Haiku	33%	Acceptable
GPT-4o-mini	13%	Avoid

Avoid GPT-4o-mini for RAG pipelines. Its 13% tree similarity means it destroys document structure while extracting text.

Quick Reference Matrix

Use Case	Recommended	Alternative	Avoid
Max quality	Gemini 3 Pro	GPT-5.1	Haiku
Quality + cost balance	LlamaParse	Gemini 2.0 Flash	GPT-4o-mini
High-volume processing	Open source	Gemini Flash	Premium LLMs
RAG pipelines	LlamaParse	GPT-5.1	GPT-4o-mini
Academic papers	Gemini 3 Pro	GPT-5.1	Haiku
Legal contracts	Any	—	—

Why Not Just Use Gemini 3 Pro for Everything?

A reasonable question: if Gemini 3 Pro achieves 88% edit similarity—the highest in our tests—why consider anything else? The quality gap is modest. Gemini 3 Pro's 88% vs LlamaParse's 78% is a 10-point gap. On robustness metrics (ChrF++), LlamaParse actually leads at 81%. The premium tier offers incremental improvement, not transformative quality. Cost at scale. Processing 100,000 documents monthly costs ~$1,000 with Gemini 3 Pro versus ~$300 with LlamaParse. At a million documents, the gap is $10,000 versus $3,000. Domain-specific failures. Even Gemini 3 Pro drops to 60% on academic papers. Premium pricing doesn't buy immunity from document complexity. Latency. Frontier LLMs take 10-30 seconds per page. For real-time applications, this may be prohibitive. Consistency. LLMs exhibit variance in structure extraction. The same document parsed twice may produce different markdown structures. For pipelines requiring deterministic output, this variance is a problem. The bottom line: Gemini 3 Pro leads our benchmarks, but the gap to budget alternatives is smaller than marketing suggests. Match your parser to your requirements, not to leaderboard positions.

Managing Parser Complexity: The PDFsmith Approach

One pattern emerged clearly from our benchmarking work: the switching cost problem is as significant as the parser selection problem. To manage this complexity, we standardized on a unified interface. PDFsmith is the open-source library we built for this purpose. It provides a single API to 15+ parser backends: ```python from pdfsmith import parse

# Auto-select best available backend markdown = parse("document.pdf")

# Use specific backend markdown = parse("contract.pdf", backend="pypdfium2") markdown = parse("academic.pdf", backend="marker") markdown = parse("tables.pdf", backend="pdfplumber") ``` PDFsmith doesn't improve the underlying parsers; it eliminates the friction of switching between them. When your requirements change—or when you discover that your current parser struggles with a document class—you can swap backends without rewriting integration code.

Conclusion

We set out to answer a simple question: which PDF parser should we use? We learned the question itself was flawed. There is no "best" parser. Parser performance varies dramatically by document type, required capabilities, cost constraints, and quality dimension. The 55-point accuracy gap between easy domains (legal at 95%) and hard domains (academic at 40%) dwarfs the 10-point gap between premium and budget LLMs. The practical lessons: Know your documents. Before selecting a parser, understand your document portfolio. Legal contracts? Most parsers work fine. Academic papers? Even the best parser achieves only 60%. Domain determines everything. Know your constraints. Quality, cost, speed, structure preservation—you can't optimize all of them. Decide which matter most for your use case. Consider LlamaParse as your default. It leads on robustness (ChrF++) while costing 10-20x less than premium LLMs. For most use cases, it's the right choice. Test structure, not just text. GPT-4o-mini scores 75% on text but only 13% on structure. For RAG pipelines, that's a deal-breaker hidden by headline metrics. Plan for hard documents. Academic papers, complex forms, and multi-column layouts remain genuinely challenging. No parser handles them well. Build your quality controls accordingly. Beware of inflated benchmarks. We found no parser exceeding 88% edit similarity on our diverse corpus. Claims of 90%+ should be scrutinized against your actual document types. We built PDFbench because we needed answers for our own work. The benchmark data, methodology documentation, and PDFsmith library are available for review and evaluation.

References

Benchmarks & Evaluation Frameworks:

Ouyang et al. (2025). "OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations." CVPR 2025.
Pfitzmann et al. (2022). "DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis." arXiv:2206.01062.
Zhong et al. (2020). "Image-Based Table Recognition: Data, Model, and Evaluation." PubTabNet. ECCV 2020.

Metrics & Methodologies:

TEDS (Tree-Edit-Distance-Similarity): Standard for table structure evaluation, introduced with PubTabNet.
ChrF++: Character n-gram metric for robust text comparison.

Our Work:

Applied AI. (2025). PDFbench: PDF Parser Benchmark Suite. Open source evaluation framework.
Applied AI. (2025). PDFsmith: Unified PDF Parser Library. Open source.

Applied AI specializes in document intelligence systems that actually work in production. If you're building document processing pipelines and want to discuss your specific requirements, [contact us].