Best Open-Source PDF-to-Markdown Tools in 2026: Marker vs Docling vs MinerU vs pdf-craft vs PyMuPDF4LLM
Youβre building a RAG pipeline, processing research papers, or digitizing a bookshelf β and you need PDFs turned into clean Markdown. The good news: open-source tooling for this has gotten very good. The bad news: there are now enough options that picking the right one takes real research.
I did that research. Here are the five tools worth knowing about in 2026, what each does best, and which one you should actually use.
The Contenders
1. Marker (datalab-to/marker) β β 19K+
The all-rounder. Marker converts PDFs, images, DOCX, PPTX, XLSX, HTML, and EPUB to Markdown, JSON, or chunks. It uses Surya OCR for text recognition and supports GPU, CPU, and Apple MPS.
What sets Marker apart is the optional --use_llm flag that layers an LLM on top for accuracy-critical documents. Without it, you get fast local conversion. With it, you get near-perfect output on messy layouts.
Best for: General-purpose conversion, multi-format pipelines, teams that want one tool for everything.
Strengths:
- Widest format support (not just PDFs)
- Optional LLM boost for higher accuracy
- Built-in benchmarking suite
- Structured JSON extraction with schema support
- ~25 pages/sec throughput on H100 in batch mode
Trade-offs:
- LLM mode adds latency and API cost
- Surya OCR is good but not best-in-class for every language
2. Docling (docling-project/docling) β β 20K+
The enterprise pick. Built by IBM Research, Docling is designed for production RAG pipelines. It outputs a structured DoclingDocument that preserves semantic hierarchy β not just text, but the meaning of the document structure.
Docling handles PDF, DOCX, PPTX, XLSX, HTML, images, audio (WAV/MP3), LaTeX, and plain text. It has first-class integrations with LlamaIndex, LangChain, and other gen AI frameworks.
Best for: Enterprise document processing, structured extraction, teams already in the LlamaIndex/LangChain ecosystem.
Strengths:
- Rich structured output (DoclingDocument format)
- Advanced layout analysis and reading order detection
- Native integrations with RAG frameworks
- Handles audio and image formats too
- Linux Foundation AI project β strong governance
Trade-offs:
- Heavier setup than simpler tools
- CPU mode is notably slower
- Formula support less mature than Marker
3. MinerU (opendatalab/MinerU) β β 30K+
The heavy hitter. MinerU has the most GitHub stars for a reason β itβs extremely capable at complex document layouts, especially CJK content. Built by OpenDataLab (Shanghai AI Lab), it uses PaddleOCR and custom layout detection models.
MinerU outputs both Markdown and JSON, with explicit support for agentic workflows. It also has the broadest hardware support of any tool here, running on NVIDIA, AMD, and a dozen Chinese accelerator platforms (Ascend, Kunlunxin, Cambricon, etc.).
Best for: CJK documents, complex academic papers, teams with diverse GPU hardware.
Strengths:
- Best layout detection for complex documents
- Excellent CJK language support
- Runs on practically any accelerator
- Active development (v2.7.6 as of Feb 2026)
- LLM-ready JSON output
Trade-offs:
- Heavier dependencies (PaddlePaddle ecosystem)
- Setup can be complex on non-standard hardware
- Primarily focused on PDFs (not multi-format)
4. pdf-craft (oomol-lab/pdf-craft) β β 4.6K
The book specialist. pdf-craft is purpose-built for converting scanned books to Markdown or EPUB. It uses DeepSeek OCR and runs entirely locally β no network requests, no API keys, no cloud dependency.
Starting with v1.0, pdf-craft dropped LLM-based text correction entirely in favor of pure OCR. The result is faster, more predictable conversion. It automatically generates EPUB table of contents and handles footnotes, formulas, and tables.
Best for: Digitizing physical books, library/archive projects, anyone who wants fully offline conversion.
Strengths:
- Purpose-built for scanned books (headers/footers auto-filtered)
- EPUB output with auto-generated TOC
- Fully local β zero network calls
- DeepSeek OCR handles tables and formulas well
- Clean, focused codebase
Trade-offs:
- PDF-only (no DOCX, PPTX, etc.)
- Requires GPU for practical use
- Smaller community than Marker/Docling/MinerU
- No LLM correction option (removed in v1.0)
5. PyMuPDF4LLM (pymupdf/pymupdf4llm) β β 2K+
The lightweight option. PyMuPDF4LLM is a thin extension of PyMuPDF that extracts PDF content as LLM-ready Markdown. No ML models, no GPU, no heavy dependencies. Just pip install pymupdf4llm and go.
It detects headers, paragraphs, tables, and images using PyMuPDFβs layout engine. For native (non-scanned) PDFs with embedded text, itβs by far the fastest option.
Best for: Native PDFs, quick extraction, CPU-only environments, preprocessing before feeding to an LLM.
Strengths:
- Fastest conversion (no ML overhead)
- Minimal dependencies
- Works everywhere β CPU only, no GPU needed
- Good table detection for structured PDFs
- Trivial to integrate into any Python pipeline
Trade-offs:
- No OCR β useless for scanned documents
- Accuracy depends heavily on PDF structure quality
- No formula recognition
- Less sophisticated layout analysis
Head-to-Head Comparison
| Feature | Marker | Docling | MinerU | pdf-craft | PyMuPDF4LLM |
|---|---|---|---|---|---|
| GitHub Stars | 19K+ | 20K+ | 30K+ | 4.6K | 2K+ |
| OCR Engine | Surya | Custom | PaddleOCR | DeepSeek | None |
| GPU Required | No (optional) | No (optional) | Recommended | Yes | No |
| Scanned PDFs | β | β | β | β (specialized) | β |
| Multi-format | PDF, DOCX, PPTX, EPUB, images | PDF, DOCX, PPTX, HTML, audio | |||
| Tables | β β | β β | β β β | β | β |
| Formulas | β β | β | β β | β β | β |
| EPUB Output | β | β | β | β | β |
| LLM Boost | Optional | β | β | β (removed) | β |
| Fully Local | β | β | β | β | β |
| CJK Support | Good | Good | Excellent | Good | Basic |
| RAG Integration | JSON/chunks | LlamaIndex, LangChain | JSON | β | Basic |
| Speed (native PDF) | Fast | Moderate | Moderate | N/A | Fastest |
| Speed (scanned) | Fast (GPU) | Moderate | Fast (GPU) | Fast (GPU) | N/A |
Decision Tree
Start here:
-
Is your PDF native (has selectable text)? β Try PyMuPDF4LLM first. Itβs fast, light, and often good enough. If the output quality isnβt there, move to Marker.
-
Are you processing scanned books? β pdf-craft is built for exactly this. If you also need multi-format support, use Marker instead.
-
Do you need structured output for a RAG pipeline? β Docling if youβre in the LlamaIndex/LangChain ecosystem. Marker if you want JSON schema extraction.
-
Complex academic papers with CJK text? β MinerU. Nothing else comes close for Chinese/Japanese/Korean layout detection.
-
Want one tool that does everything reasonably well? β Marker. Itβs the Swiss Army knife of this category.
The Bottom Line
The PDF-to-Markdown space has matured fast. Two years ago you were choosing between βbadβ and βless bad.β Now youβre choosing between genuinely good tools with different specializations.
If youβre only going to install one tool: Marker is the safest default. If you have a specific use case β books (pdf-craft), enterprise RAG (Docling), CJK (MinerU), or quick-and-dirty native PDF extraction (PyMuPDF4LLM) β pick the specialist.
All five are open source, all run locally, and all are actively maintained. The real winner is the ecosystem.
Have a favorite PDF-to-Markdown tool I missed? Let me know.