PDFsmith

MIT License · View on GitHub · pip install pdfsmith

Unified Python API for 19+ PDF parsing backends including frontier LLMs

Choosing a PDF parser shouldn't require a PhD in document processing. We built PDFsmith after spending months benchmarking 19 different parsers and realizing that no single tool works best for every document type. The core insight: scanned invoices need OCR, complex tables need pdfplumber, and some layouts only frontier LLMs can handle. PDFsmith lets you define routing rules or use our benchmark-informed defaults—send each document to the right backend automatically. What you get: • A single API that works with pymupdf, pdfplumber, pypdfium2, marker, docling, Claude, GPT-4o, and 12 more backends • Smart routing based on document characteristics—no manual backend selection required • Modular installation—the core package has zero heavy dependencies; add backends as needed • Consistent output format regardless of which backend processes the document Built for teams who process diverse document types and don't want to maintain separate pipelines for each.

Why pdfsmith?

  • One API for 19+ backends — switch parsers without changing code
  • Benchmark-informed defaults — auto-selects based on pdf-bench findings
  • Frontier LLM support — Claude, GPT-4o, Gemini for challenging documents
  • Modular installation — pip install only what you need
  • Production ready — consistent error handling, unified output format
from pdfsmith import parse

# Auto-select best available backend
markdown = parse("document.pdf")

# Use specific backend
markdown = parse("document.pdf", backend="docling")

# Use frontier LLM for complex documents
markdown = parse("document.pdf", backend="anthropic")
One API, any backend
pip install pdfsmith                 # Core only
pip install pdfsmith[light]          # pypdf, pdfplumber, pymupdf
pip install pdfsmith[recommended]    # Balanced stack
pip install pdfsmith[frontier]       # Claude, GPT-4o, Gemini
pip install pdfsmith[all]            # Everything
Install what you need

Questions about this project? Open an issue on GitHub or contact us directly.