PDFbench

MIT License · View on GitHub

Rigorous benchmarks for 27 PDF parsers—including frontier LLMs—across 800+ real-world documents

PDF Parser Trade-offs: Speed vs Accuracy (v2 - Full Data)
PDF Parser Trade-offs: Speed vs Accuracy (v2 - Full Data)

We tested 27 parsers—from sub-millisecond open source tools to frontier LLMs like GPT-5.1 and Gemini 3 Pro—on 800+ real-world documents. Six metrics measured independently to capture different aspects of parsing quality.

Document Domains Tested

  • Legal contracts — SEC filings from the CUAD dataset, dense text with numbered clauses
  • Business invoices — Tables, line items, tax calculations across varying complexity
  • Legal templates — Employment agreements, NDAs, service contracts with hierarchical numbering
  • HR documents — Resumes and CVs with section detection and varied layouts
  • Academic papers — Two-column layouts, equations, mixed content from OmniDocBench
  • Synthetic test cases — Controlled documents for systematic capability testing

Metrics Explained

  • Edit Similarity — Overall text accuracy (primary metric). 92% means 92% of characters match.
  • Tree Similarity — How well document structure (headings, sections) is preserved
  • TEDS — Table extraction quality. Critical for invoices and financial documents.
  • ChrF++ — Character/word n-gram matching. More forgiving of word order variations.
  • CER — Character Error Rate. Lower is better. Standard OCR evaluation metric.
  • Element F1 — Are all document elements (headings, lists, tables) correctly identified?

Key finding: GPT-5.1 leads at 92% accuracy, but open source (pypdfium2 at 80.6%) runs 10,000x faster at zero cost. The right choice depends on your volume, budget, and quality requirements.

Key Findings

  • Top 5 parsers cluster within 1 percentage point (80.3-80.6% accuracy) — the choice barely matters for text extraction
  • Speed varies 35,000x: pymupdf (0.7ms) vs marker (14.7s) — pick based on latency budget
  • Domain matters more than parser: legal contracts hit 95%+, invoices struggle at 50% everywhere
  • Text accuracy ≠ structure quality: a parser can nail text (80%) but destroy hierarchy (40%)
  • pdfplumber dominates table extraction at 93.4% TEDS — use it for financial documents

The Benchmark

27 parsers tested on 800+ documents across 6 domains. Six metrics measured independently: text accuracy, structure recovery, table extraction, and speed.

Full methodology, raw data exports, and reproducible evaluation scripts in the repository.

# Clone and run the benchmark yourself
git clone https://github.com/applied-artificial-intelligence/pdf-parser-benchmark
cd pdf-parser-benchmark
uv run python -m pdfbench evaluate --parser pymupdf --corpus cuad
Reproduce our results

Questions about this project? Open an issue on GitHub or contact us directly.