The State of PDF Parsing: What 800+ Documents Taught Us About Parser Selection
There is no "best" PDF parser. The right choice depends on your documents, your budget, and whether you need structure or just text.
The Problem: Parser Selection Is an Optimization Problem
If you've ever tried to select a PDF parser for a production pipeline, you know the challenge. The landscape offers numerous options, each with carefully curated benchmarks that favor their approach.
The choices are numerous. Our evaluation covered 17 parsers: 11 open source (pypdf, pymupdf, pdfplumber, docling, marker, and others), 4 commercial APIs (AWS Textract, Azure Document Intelligence, Google Document AI, LlamaParse), and frontier LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5). Each has a different philosophy, different strengths, and different failure modes. Some prioritize speed. Others prioritize accuracy. Some handle tables well but struggle with layout. Others excel at OCR but lose document structure.
Vendor benchmarks are unreliable. Every parser vendor shows favorable performance data—on their chosen test sets. Academic benchmarks exist, but they typically use synthetic documents or narrow corpora that don't reflect real business documents: scanned contracts with coffee stains, invoices with variable layouts, regulatory filings with nested tables.
Switching costs are prohibitive. Once you've integrated a parser into your pipeline, changing it means rewriting extraction logic, revalidating output formats, and retraining downstream models. Most teams pick a parser early and stick with it—even when problems emerge—because the switching cost is too high.
We built PDFbench because we needed answers for our own document intelligence work—and because we suspected the conventional wisdom was wrong.
Our Approach: How We Built PDFbench
PDFbench is not an academic benchmark. It's a practitioner's tool, designed to answer the questions that actually matter when building document pipelines.
The Corpus: 800+ Real Documents
We assembled a corpus of 800+ documents across 6 domains:
| Domain | Documents | Requires OCR |
|---|---|---|
| Legal Contracts (CUAD) | 75 | No |
| Legal Templates | 108 | No |
| Invoices | 100 | No |
| HR Documents | 34 | No |
| Synthetic (test docs) | 31 | No |
| OmniDocBench (academic, mixed) | 252 | Yes |
| Total | 800+ |
This isn't a toy dataset. The CUAD legal contracts are real SEC filings, dense with legal language and variable formatting. The invoices come from multiple vendors with wildly different layouts. The OmniDocBench documents include academic papers, financial reports, textbooks, and research documents with complex visual elements.
The split: 358 digital PDFs (text extraction) and 252 scanned documents (OCR required). Most parsers were tested on 200-360 documents each.
The Parsers: 17 Options Evaluated
We tested across three categories:
Frontier LLMs (7) — Head-to-head on 30 documents:
- Premium: GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5
- Budget: GPT-4o-mini, Gemini 2.0 Flash, Claude 3.5 Haiku, LlamaParse
Commercial APIs (4) — Tested on 200+ documents:
- AWS Textract, Azure Document Intelligence, Google Document AI, Databricks
Open Source (6+) — Tested on full corpus:
- Text-focused: pypdf, pypdfium2, pymupdf, pdfplumber, pdfminer
- Structure-aware: docling, marker
The Metrics: Why One Number Isn't Enough
PDF parsing is actually five different problems masquerading as one:
- Text extraction — Did we get the right characters?
- OCR — Can we read scanned documents?
- Structure recovery — Did we preserve headings, lists, sections?
- Table extraction — Did we get tabular data right?
- Output quality — Is the markdown usable for downstream tasks?
A single accuracy number hides critical distinctions.
Core Metrics We Use
| Metric | Measures | Why It Matters |
|---|---|---|
| Edit similarity | Character-level text accuracy | Core extraction quality |
| ChrF++ | Character n-gram F-score | Robustness to minor errors |
| Tree Edit Distance | Document structure as AST | RAG and LLM pipeline quality |
| TEDS | Table structure accuracy | Financial and structured data |
| Pairwise Ordering | Reading sequence correctness | Multi-column coherence |
Tree Edit Distance (TED) deserves special attention. We parse both the predicted and ground-truth Markdown into Abstract Syntax Trees using the CommonMark specification. The AST represents document structure as a tree: a section header contains paragraphs, which contain inline elements; a list contains list items, which contain paragraphs. The edit distance—minimum node insertions, deletions, and renames to transform one tree into another—captures structural hallucinations that text-level metrics miss.
A parser can ace text extraction (80%) and fail structure recovery (40%). That gap matters enormously for RAG pipelines, where document structure determines chunk boundaries and retrieval quality.
Key Finding 1: The Frontier LLM Landscape
Our benchmark reveals a nuanced quality hierarchy. On our 30-document comparison set—covering synthetic, academic, legal, invoice, and resume documents—we tested all 7 frontier models head-to-head:
| Category | Parser | Edit Similarity | ChrF++ | Cost/Doc | Tree Sim |
|---|---|---|---|---|---|
| Premium LLM | Gemini 3 Pro | 88% | 77% | $0.010 | 42% |
| Premium LLM | GPT-5.1 | 84% | 79% | $0.036 | 42% |
| Premium LLM | Claude Sonnet 4.5 | 78% | 77% | $0.058 | 38% |
| Budget LLM | LlamaParse | 78% | 81% | $0.003 | 39% |
| Budget LLM | Gemini 2.0 Flash | 78% | 78% | ~$0.001 | 35% |
| Budget LLM | GPT-4o-mini | 75% | 75% | ~$0.001 | 13% |
| Budget LLM | Claude 3.5 Haiku | 69% | 65% | ~$0.001 | 33% |
Note: 30 documents tested across 5 domains (10 synthetic, 5 academic, 5 legal, 5 invoices, 5 resumes). All parsers evaluated on identical documents for fair comparison.
Key observations:
Gemini 3 Pro leads on text extraction at 88% edit similarity—6 points ahead of the budget tier. But the gap is smaller than we expected. There's no 90%+ parser in our tests.
LlamaParse leads on robustness (ChrF++) at 81%, slightly ahead of GPT-5.1 at 79%. ChrF++ measures character n-gram overlap and is more forgiving of minor formatting differences.
GPT-4o-mini is a structure destroyer. Despite acceptable 75% text scores, it achieves only 13% tree similarity—less than half of any other parser. For RAG pipelines, this model is actively harmful.
Cost varies 60x. Claude Sonnet 4.5 costs $0.058/doc versus ~$0.001/doc for budget models. At 100,000 documents/month, that's $5,800 versus $100.
The practical takeaway: Premium LLMs offer modest quality improvements (6-10 points) at substantial cost increases (10-60x). For most use cases, budget LLMs or LlamaParse deliver better value.
Key Finding 2: LlamaParse Is the Sweet Spot
LlamaParse occupies a unique position in the quality/cost landscape—and in our head-to-head tests, it actually leads on robustness metrics:
- 81% ChrF++ — highest among all 7 parsers tested
- 78% edit similarity — matches Gemini 2.0 Flash and Claude Sonnet 4.5
- 39% tree similarity — solid structure preservation, well above GPT-4o-mini
- $0.003 per page — 12x cheaper than GPT-5.1, 19x cheaper than Claude Sonnet 4.5
- Purpose-built for PDF — not a general LLM doing PDF on the side
For most use cases, LlamaParse offers the best quality/cost ratio. It matches or exceeds premium LLM quality on robustness metrics while costing 10-20x less.
When to choose LlamaParse:
- You want premium-tier results at budget-tier cost
- Your volume is moderate (thousands to tens of thousands of pages monthly)
- You need reliable structure preservation for RAG pipelines
When to choose something else:
- You need maximum text fidelity (use Gemini 3 Pro at 88% edit similarity)
- You process millions of pages monthly (use open source)
- You need specific capabilities like table extraction (use pdfplumber)
Key Finding 3: The 55-Point Domain Gap
Parser accuracy varies by 55+ percentage points depending on document type—a gap that dwarfs the differences between parsers on any single domain.
| Domain | Best Parser | Edit Sim | Worst Parser | Edit Sim | Gap |
|---|---|---|---|---|---|
| Legal Contracts | Gemini Flash | 95% | Haiku | 55% | 40pt |
| Resumes | Haiku | 92% | GPT-4o-mini | 88% | 4pt |
| Synthetic | Gemini 3 Pro | 93% | GPT-4o-mini | 86% | 7pt |
| Invoices | Haiku | 80% | GPT-4o-mini | 74% | 6pt |
| Academic Papers | Gemini 3 Pro | 60% | Haiku | 8% | 52pt |
Legal contracts are easy. On CUAD legal documents, the best parsers achieve 93-95% edit similarity. Well-formatted, standard fonts, consistent layouts. Parser choice matters less here.
Academic papers are genuinely hard. ArXiv papers with equations, figures, and complex layouts challenge every parser. Even Gemini 3 Pro—the leader—achieves only 60%. Claude 3.5 Haiku collapses to 8%, essentially failing on this document type.
The 55-point spread between legal (95%) and academic (40%) swamps the differences between parsers on any single domain.
If you're processing a homogeneous document type—say, contracts from your own legal templates—almost any parser will work. But if you're building a pipeline that handles mixed document types, parser selection becomes a portfolio decision. You may need different parsers for different document classes, or a triage system that routes documents to specialized extractors.
Key Finding 4: Academic Papers Break Every Parser
This finding was stark. Academic papers—ArXiv submissions with equations, figures, multi-column layouts, and complex formatting—challenged every parser we tested.
| Parser | Academic Edit Sim | Overall Edit Sim | Gap |
|---|---|---|---|
| Gemini 3 Pro | 60% | 88% | -28pt |
| GPT-5.1 | 39% | 84% | -45pt |
| LlamaParse | 38% | 78% | -40pt |
| Claude Sonnet 4.5 | 34% | 78% | -44pt |
| Gemini 2.0 Flash | 34% | 78% | -44pt |
| GPT-4o-mini | 34% | 75% | -41pt |
| Claude 3.5 Haiku | 8% | 69% | -61pt |
What makes academic papers hard:
Mathematical notation. LaTeX equations, subscripts, superscripts, and Greek letters don't survive most parsing pipelines. Even when text is extracted, the semantic meaning is lost.
Multi-column layouts. Two-column academic formatting confuses reading order. Parsers often interleave columns or fragment paragraphs.
Figures and captions. Charts, diagrams, and their captions are integral to academic content but poorly handled by text-focused parsers.
Dense, specialized formatting. References, footnotes, abstracts, and section hierarchies follow conventions that parsers don't recognize.
The Haiku collapse. Claude 3.5 Haiku—which performs adequately on other document types—achieves only 8% on academic papers. This isn't gradual degradation; it's near-total failure.
The implication: If your use case involves academic or technical documents, benchmark extensively before committing. The parser that works well on contracts may fail catastrophically on papers.
Key Finding 5: Text Accuracy Doesn't Mean Structure Quality
Text extraction accuracy and structure recovery are largely independent. Our 30-document comparison reveals dramatic gaps:
| Parser | Edit Similarity | Tree Similarity | Gap |
|---|---|---|---|
| GPT-5.1 | 84% | 42% | 42pt |
| Gemini 3 Pro | 88% | 42% | 46pt |
| LlamaParse | 78% | 39% | 39pt |
| Claude Sonnet 4.5 | 78% | 38% | 40pt |
| Gemini 2.0 Flash | 78% | 35% | 43pt |
| Claude 3.5 Haiku | 69% | 33% | 36pt |
| GPT-4o-mini | 75% | 13% | 62pt |
The GPT-4o-mini problem: This model scores 75% on text extraction—acceptable at first glance. But it achieves only 13% tree similarity, versus 33-42% for other parsers. It extracts text while destroying document structure. For RAG pipelines where structure determines chunk boundaries, GPT-4o-mini is actively harmful despite reasonable text scores.
Why structure matters for AI pipelines:
- Chunk boundaries. Where does one section end and another begin? Poor structure recovery means poor chunking, which means irrelevant retrieval.
- Heading hierarchy. What's a main section vs. a subsection? Lost hierarchy means lost context.
- List and table integrity. Are those three items a list, or three separate paragraphs? The distinction matters for downstream processing.
The takeaway: If you're building RAG pipelines, don't rely on text accuracy benchmarks alone. Test structure recovery explicitly. GPT-4o-mini's 62-point gap between text and structure scores makes it unsuitable for most AI applications despite acceptable text extraction.
Key Finding 6: The Metric That Lies
Here's a counterintuitive result: open-source parsers score 90+ on ChrF++ but only 70s on edit similarity. How can the same parser score 20 points higher on one metric?
| Parser | Edit Similarity | ChrF++ | Gap |
|---|---|---|---|
| pypdfium2 | 78% | 90.4 | +12 |
| pypdf | 78% | 90.4 | +12 |
| pymupdf | 77% | 90.5 | +13 |
| pdfplumber | 70% | 91.6 | +22 |
What's happening: ChrF++ measures character n-gram overlap—whether the right characters appear, regardless of order or structure. Edit similarity measures the minimum edits needed to transform output into ground truth—it penalizes reordering, missing whitespace, and structural changes.
A parser can extract all the right characters (high ChrF++) while scrambling their order or losing structure (lower edit similarity).
The implication: Don't trust single-metric benchmarks. A parser with 90+ ChrF++ might still break your pipeline if it destroys reading order. For RAG applications, edit similarity and tree similarity matter more than character overlap.
Visualizing Trade-offs: The Pareto Front
A single leaderboard ranking hides important choices. We recommend visualizing parser performance as a Pareto front:

This visualization immediately reveals that "best parser" questions are incomplete. The right question is: "Which parser is best given my speed/accuracy/OCR constraints?"
Decision Framework: How to Select a Parser
Based on our findings, here's a practical framework for parser selection:
Question 1: What's Your Quality Requirement?
| If You Need | Choose | Why |
|---|---|---|
| Maximum text fidelity (88%+) | Gemini 3 Pro | Highest edit similarity in our tests |
| High quality (84%+) | GPT-5.1 | Strong overall, good structure |
| Best value (78-81%) | LlamaParse | Highest ChrF++, 12x cheaper than premium |
| Budget option (75-78%) | Gemini 2.0 Flash | Matches LlamaParse quality, lower cost |
| Avoid | GPT-4o-mini | Destroys structure (13% tree similarity) |
Note: No parser in our tests exceeded 88% edit similarity. Claims of 90%+ should be scrutinized.
Question 2: What's Your Budget?
| Monthly Volume | Recommended Approach | Estimated Cost |
|---|---|---|
| <1,000 docs | Gemini 3 Pro or GPT-5.1 | <$40/month |
| 1,000-10,000 docs | LlamaParse | $3-30/month |
| 10,000-100,000 docs | LlamaParse or Gemini Flash | $30-100/month |
| >100,000 docs | Open source | Compute only |
Question 3: What Document Types?
| Document Type | Best Parser | Accuracy | Notes |
|---|---|---|---|
| Legal contracts | Gemini Flash | 95% | Most parsers work well |
| Resumes/HR docs | Any | 88-92% | Low variance between parsers |
| Invoices | Haiku | 80% | Moderate difficulty |
| Academic papers | Gemini 3 Pro | 60% | Hard—expect degradation |
Critical warning: If your corpus includes academic/technical papers, benchmark extensively. Claude 3.5 Haiku achieves only 8% on academic content despite 69% overall.
Question 4: Do You Need Structure for RAG?
Structure preservation varies dramatically:
| Parser | Tree Similarity | RAG Suitability |
|---|---|---|
| GPT-5.1 | 42% | Good |
| Gemini 3 Pro | 42% | Good |
| LlamaParse | 39% | Good |
| Claude Sonnet 4.5 | 38% | Good |
| Gemini 2.0 Flash | 35% | Acceptable |
| Claude 3.5 Haiku | 33% | Acceptable |
| GPT-4o-mini | 13% | Avoid |
Avoid GPT-4o-mini for RAG pipelines. Its 13% tree similarity means it destroys document structure while extracting text.
Quick Reference Matrix
| Use Case | Recommended | Alternative | Avoid |
|---|---|---|---|
| Max quality | Gemini 3 Pro | GPT-5.1 | Haiku |
| Quality + cost balance | LlamaParse | Gemini 2.0 Flash | GPT-4o-mini |
| High-volume processing | Open source | Gemini Flash | Premium LLMs |
| RAG pipelines | LlamaParse | GPT-5.1 | GPT-4o-mini |
| Academic papers | Gemini 3 Pro | GPT-5.1 | Haiku |
| Legal contracts | Any | — | — |
Why Not Just Use Gemini 3 Pro for Everything?
A reasonable question: if Gemini 3 Pro achieves 88% edit similarity—the highest in our tests—why consider anything else?
The quality gap is modest. Gemini 3 Pro's 88% vs LlamaParse's 78% is a 10-point gap. On robustness metrics (ChrF++), LlamaParse actually leads at 81%. The premium tier offers incremental improvement, not transformative quality.
Cost at scale. Processing 100,000 documents monthly costs ~$1,000 with Gemini 3 Pro versus ~$300 with LlamaParse. At a million documents, the gap is $10,000 versus $3,000.
Domain-specific failures. Even Gemini 3 Pro drops to 60% on academic papers. Premium pricing doesn't buy immunity from document complexity.
Latency. Frontier LLMs take 10-30 seconds per page. For real-time applications, this may be prohibitive.
Consistency. LLMs exhibit variance in structure extraction. The same document parsed twice may produce different markdown structures. For pipelines requiring deterministic output, this variance is a problem.
The bottom line: Gemini 3 Pro leads our benchmarks, but the gap to budget alternatives is smaller than marketing suggests. Match your parser to your requirements, not to leaderboard positions.
Managing Parser Complexity: The PDFsmith Approach
One pattern emerged clearly from our benchmarking work: the switching cost problem is as significant as the parser selection problem. To manage this complexity, we standardized on a unified interface.
PDFsmith is the open-source library we built for this purpose. It provides a single API to 15+ parser backends:
from pdfsmith import parse
# Auto-select best available backend
markdown = parse("document.pdf")
# Use specific backend
markdown = parse("contract.pdf", backend="pypdfium2")
markdown = parse("academic.pdf", backend="marker")
markdown = parse("tables.pdf", backend="pdfplumber")
PDFsmith doesn't improve the underlying parsers; it eliminates the friction of switching between them. When your requirements change—or when you discover that your current parser struggles with a document class—you can swap backends without rewriting integration code.
Conclusion
We set out to answer a simple question: which PDF parser should we use? We learned the question itself was flawed.
There is no "best" parser. Parser performance varies dramatically by document type, required capabilities, cost constraints, and quality dimension. The 55-point accuracy gap between easy domains (legal at 95%) and hard domains (academic at 40%) dwarfs the 10-point gap between premium and budget LLMs.
The practical lessons:
Know your documents. Before selecting a parser, understand your document portfolio. Legal contracts? Most parsers work fine. Academic papers? Even the best parser achieves only 60%. Domain determines everything.
Know your constraints. Quality, cost, speed, structure preservation—you can't optimize all of them. Decide which matter most for your use case.
Consider LlamaParse as your default. It leads on robustness (ChrF++) while costing 10-20x less than premium LLMs. For most use cases, it's the right choice.
Test structure, not just text. GPT-4o-mini scores 75% on text but only 13% on structure. For RAG pipelines, that's a deal-breaker hidden by headline metrics.
Plan for hard documents. Academic papers, complex forms, and multi-column layouts remain genuinely challenging. No parser handles them well. Build your quality controls accordingly.
Beware of inflated benchmarks. We found no parser exceeding 88% edit similarity on our diverse corpus. Claims of 90%+ should be scrutinized against your actual document types.
We built PDFbench because we needed answers for our own work. The benchmark data, methodology documentation, and PDFsmith library are available for review and evaluation.
References
Benchmarks & Evaluation Frameworks:
- Ouyang et al. (2025). "OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations." CVPR 2025.
- Pfitzmann et al. (2022). "DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis." arXiv:2206.01062.
- Zhong et al. (2020). "Image-Based Table Recognition: Data, Model, and Evaluation." PubTabNet. ECCV 2020.
Metrics & Methodologies:
- TEDS (Tree-Edit-Distance-Similarity): Standard for table structure evaluation, introduced with PubTabNet.
- ChrF++: Character n-gram metric for robust text comparison.
Our Work:
- Applied AI. (2025). PDFbench: PDF Parser Benchmark Suite. Open source evaluation framework.
- Applied AI. (2025). PDFsmith: Unified PDF Parser Library. Open source.
Applied AI specializes in document intelligence systems that actually work in production. If you're building document processing pipelines and want to discuss your specific requirements, [contact us].