The 45-Point Gap

# The 45-Point Gap Parser accuracy varies 45 percentage points or more depending on document type. We didn't expect this. We expected some variation—maybe 10-15 points. What we found challenges the fundamental assumption behind "best parser" discussions. --- ## What We Did We built PDFbench to answer a practical question: which PDF parser should we use? Our document intelligence work handles diverse content—legal contracts, financial reports, regulatory filings, technical documentation. Clients ask which parser works best. Vendors provide benchmarks showing their product on top. Industry comparisons use synthetic datasets that don't reflect real documents. So we built a benchmark using real business documents. **The corpus: 600+ documents across 7 domains.** Legal contracts from the CUAD dataset. Invoices with variable layouts. HR documents. Academic papers. Scanned documents requiring OCR. Not toy examples—real documents with real formatting challenges. **The parsers: 17 options evaluated.** 10 open-source (pypdf, pymupdf, pdfplumber, docling, marker, and more). 4 commercial APIs (AWS Textract, Azure Document Intelligence, Google Document AI, LlamaParse). And now: 6 frontier LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5, plus budget variants). **The metrics: 6 different quality dimensions.** Text extraction accuracy. Structure recovery. Table extraction. Processing speed. We measured separately because a parser can ace one dimension and fail another. --- ## The Finding That Changed Our Thinking The 45-point gap emerged when we broke down results by document domain: | Domain | Best Accuracy | Worst Accuracy | Gap | |--------|---------------|----------------|-----| | Legal Contracts | ~95% | ~85% | 10 pts | | ArXiv Papers | ~85% | ~60% | 25 pts | | Invoices | ~50% | ~35% | 15 pts | Legal contracts hit 95%+ with almost any parser. The documents are well-formatted, use standard fonts, have consistent layouts. Parser choice barely matters—they all work. Invoices struggle to exceed 50% with any parser. Every vendor has different templates. Tables blend with prose. Key data appears in unpredictable locations. The gap between easiest domain (legal contracts) and hardest domain (invoices) is 45 percentage points. That swamps the differences between parsers on any single domain. **The implication is uncomfortable: "which parser is best?" is the wrong question.** The right question is: "what documents do I have?" --- ## Why Invoices Break Everything We expected invoices to be harder. We didn't expect them to be universally hard—no parser exceeding 50%. The structural challenges: **Layout variability.** Every vendor uses a different invoice template. Line items might appear in tables, prose, or hybrid formats. Totals might be bottom, side, or scattered. There's no "standard invoice layout" the way there's a standard contract structure. **Semantic ambiguity.** "Net 30" appears on many invoices. Is it a payment term? Discount condition? Line item description? Parsers see characters, not meaning. Context determines interpretation, and context varies per vendor. **Mixed content.** Invoices combine structured data (tables), semi-structured data (address blocks), and unstructured data (notes, terms). No single parsing strategy handles all three well. This isn't a parser failure—it's an industry limitation. Invoices are genuinely hard documents. Recognizing that difficulty is the first step toward realistic expectations. --- ## The Structure Problem Here's a finding that reframes quality assessment: text extraction accuracy and structure recovery are largely independent. Knowing a parser extracts text accurately tells you almost nothing about whether it preserves document structure. A parser can get 80% of characters correct while destroying the hierarchy—turning headings into body text, collapsing lists into paragraphs, losing section boundaries. **Why this matters for AI pipelines:** If you're feeding parsed documents into RAG systems or LLM extraction, structure is critical: - **Chunk boundaries** depend on section detection. Poor structure means poor chunking, which means irrelevant retrieval. - **Heading hierarchy** determines context. Is this paragraph under "Payment Terms" or "Delivery Schedule"? Lost hierarchy loses meaning. - **Table integrity** affects extraction. Are those three items a list or three paragraphs? The distinction changes downstream processing. Text accuracy benchmarks hide this gap. Two parsers might both score 80% on text—but one preserves structure at 55% while the other drops to 40%. For RAG applications, that 15-point structure gap matters more than equivalent text accuracy. --- ## The Frontier LLM Surprise We added frontier LLMs to the benchmark—GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5. The results upend some assumptions. **GPT-5.1 achieves 92% edit similarity.** That's 14 points above the best open-source parser (78%). First time we've seen >90% in our benchmarks. **But budget LLMs have a hidden flaw.** GPT-4o-mini scores 75% on text extraction—acceptable at first glance. But only **13% tree similarity**. It extracts text while destroying document structure. For RAG pipelines, it's actively harmful despite reasonable text scores. **LlamaParse hits the sweet spot.** 78% accuracy at $0.003/page—matching open-source quality at 10x lower cost than frontier LLMs. Purpose-built for PDF, not a general LLM doing parsing on the side. | Category | Best Parser | Edit Sim | Cost/Page | |----------|-------------|----------|-----------| | Premium LLM | GPT-5.1 | 92% | ~$0.05 | | Budget LLM | LlamaParse | 78% | $0.003 | | Open Source | pypdfium2 | 78% | Free | The 14-point premium gap is real—but so is the 50x cost difference. Match your parser to your document value. --- ## The Speed Dimension Speed varies 35,000x across parsers. | Tier | Parsers | Median Time | |------|---------|-------------| | Very Fast | pymupdf, pypdfium2 | <1ms | | Medium | docling-fast | ~750ms | | Very Slow | marker | ~14.7s | For batch processing of 100,000 documents: - pymupdf: ~2 minutes - marker: ~17 days Both might achieve similar accuracy. The question is whether you can afford the speed trade-off. Some use cases demand real-time parsing. Others can batch overnight. Speed constraints narrow parser choices as much as accuracy requirements. --- ## What This Means Practically **Stop asking "which parser is best?"** Start asking: What domains are my documents? What accuracy do I need? What latency budget do I have? Do I need OCR? Tables? **Test on your corpus.** Vendor benchmarks use favorable test sets. Academic benchmarks use synthetic documents. Neither predicts performance on your specific documents. Run your documents through candidates. **Consider domain-specific routing.** If you process mixed document types, you may need different parsers for different classes. Contracts route to one parser. Invoices route to another with specialized extraction logic. **Plan for limitations.** Invoices, complex forms, handwritten content remain hard. Build human-in-the-loop processes for these categories. The goal isn't eliminating human review—it's focusing it where it matters. **Evaluate structure separately.** If you're building AI pipelines, don't rely on text accuracy alone. Test structure recovery explicitly. Text and structure scores are largely independent—you can't infer one from the other. --- ## Decision Framework Quick reference for parser selection: | Use Case | Primary Parser | Why | |----------|---------------|-----| | Max quality, low volume | GPT-5.1 | 92% accuracy, ~$0.05/page | | Quality + cost balance | LlamaParse | 78% at $0.003/page | | High-volume, speed-critical | pymupdf | <1ms, 77% accuracy | | RAG pipelines | LlamaParse or docling | Best structure preservation | | Table extraction | pdfplumber | 93% TEDS score | | Scanned documents | Frontier LLMs or marker | Built-in OCR | | Mixed/unknown | pypdfium2 | Best free option | **Avoid GPT-4o-mini for RAG** despite its reasonable text scores—13% tree similarity will fragment your documents. But treat this as a starting point, not an answer. Domain determines everything. --- ## The Benchmark Data We're publishing the full PDFbench results. The corpus breakdown, parser-by-domain matrix, metric definitions, and methodology are available for scrutiny. Not because we think we've solved parser selection—but because we think transparent benchmarks on real documents are more useful than vendor marketing. The 45-point gap is real. Domain determines performance. Build your strategy accordingly. --- *Building document pipelines? Reply with your use case—we're curious what document types give you the most trouble.*