What is the best open-source PDF-to-Markdown converter in 2026?

It depends on your use case. Marker offers the best balance of speed and accuracy with optional LLM boost. Docling excels at enterprise pipelines with structured output. MinerU handles CJK and complex layouts well. pdf-craft is purpose-built for scanned books. PyMuPDF4LLM is fastest for native PDFs that don't need OCR.

Which PDF-to-Markdown tool is best for RAG pipelines?

Docling and Marker are the top choices for RAG. Docling outputs a structured DoclingDocument that integrates with LlamaIndex and LangChain. Marker outputs clean Markdown and JSON with optional chunking.

Can I convert scanned PDFs to Markdown without a cloud API?

Yes. pdf-craft, MinerU, and Marker all run fully locally with GPU acceleration. pdf-craft uses DeepSeek OCR, MinerU uses PaddleOCR, and Marker uses Surya OCR.

Do I need a GPU to convert PDFs to Markdown?

Not always. PyMuPDF4LLM works on CPU with no ML models. Marker supports CPU, GPU, and Apple MPS. Docling can run on CPU but is slower. MinerU and pdf-craft strongly benefit from GPU for OCR.

Which tool handles tables and formulas best?

Marker and MinerU lead on table extraction. For LaTeX formulas, Marker and pdf-craft both handle them well. Docling's table structure recognition is strong but formula support is more limited.

What is pdf-craft and how does it compare to Marker?

pdf-craft is a newer tool from oomol-lab that uses DeepSeek OCR for fully local PDF conversion, specializing in scanned books. Marker is more general-purpose, supports more input formats (DOCX, PPTX, EPUB), and has an optional LLM accuracy boost.

Can these tools handle multi-language PDFs?

Yes. Marker supports all languages via Surya OCR. MinerU has strong CJK support via PaddleOCR. Docling handles multiple languages. pdf-craft inherits DeepSeek OCR's multilingual capabilities.

Which PDF converter is fastest?

PyMuPDF4LLM is the fastest for native (non-scanned) PDFs since it uses no ML models. For scanned PDFs, Marker projects ~25 pages/second on an H100. MinerU and pdf-craft are comparable with GPU.

Best Open-Source PDF-to-Markdown Tools in 2026: Marker vs Docling vs MinerU vs pdf-craft vs PyMuPDF4LLM

By Prahlad Menon Published 2026-03-26 7 min read

You’re building a RAG pipeline, processing research papers, or digitizing a bookshelf — and you need PDFs turned into clean Markdown. The good news: open-source tooling for this has gotten very good. The bad news: there are now enough options that picking the right one takes real research.

I did that research. Here are the five tools worth knowing about in 2026, what each does best, and which one you should actually use.

The Contenders

1. Marker (datalab-to/marker) — ⭐ 19K+

The all-rounder. Marker converts PDFs, images, DOCX, PPTX, XLSX, HTML, and EPUB to Markdown, JSON, or chunks. It uses Surya OCR for text recognition and supports GPU, CPU, and Apple MPS.

What sets Marker apart is the optional --use_llm flag that layers an LLM on top for accuracy-critical documents. Without it, you get fast local conversion. With it, you get near-perfect output on messy layouts.

Best for: General-purpose conversion, multi-format pipelines, teams that want one tool for everything.

Strengths:

Widest format support (not just PDFs)
Optional LLM boost for higher accuracy
Built-in benchmarking suite
Structured JSON extraction with schema support
~25 pages/sec throughput on H100 in batch mode

Trade-offs:

LLM mode adds latency and API cost
Surya OCR is good but not best-in-class for every language

2. Docling (docling-project/docling) — ⭐ 20K+

The enterprise pick. Built by IBM Research, Docling is designed for production RAG pipelines. It outputs a structured DoclingDocument that preserves semantic hierarchy — not just text, but the meaning of the document structure.

Docling handles PDF, DOCX, PPTX, XLSX, HTML, images, audio (WAV/MP3), LaTeX, and plain text. It has first-class integrations with LlamaIndex, LangChain, and other gen AI frameworks.

Best for: Enterprise document processing, structured extraction, teams already in the LlamaIndex/LangChain ecosystem.

Strengths:

Rich structured output (DoclingDocument format)
Advanced layout analysis and reading order detection
Native integrations with RAG frameworks
Handles audio and image formats too
Linux Foundation AI project — strong governance

Trade-offs:

Heavier setup than simpler tools
CPU mode is notably slower
Formula support less mature than Marker

3. MinerU (opendatalab/MinerU) — ⭐ 30K+

The heavy hitter. MinerU has the most GitHub stars for a reason — it’s extremely capable at complex document layouts, especially CJK content. Built by OpenDataLab (Shanghai AI Lab), it uses PaddleOCR and custom layout detection models.

MinerU outputs both Markdown and JSON, with explicit support for agentic workflows. It also has the broadest hardware support of any tool here, running on NVIDIA, AMD, and a dozen Chinese accelerator platforms (Ascend, Kunlunxin, Cambricon, etc.).

Best for: CJK documents, complex academic papers, teams with diverse GPU hardware.

Strengths:

Best layout detection for complex documents
Excellent CJK language support
Runs on practically any accelerator
Active development (v2.7.6 as of Feb 2026)
LLM-ready JSON output

Trade-offs:

Heavier dependencies (PaddlePaddle ecosystem)
Setup can be complex on non-standard hardware
Primarily focused on PDFs (not multi-format)

4. pdf-craft (oomol-lab/pdf-craft) — ⭐ 4.6K

The book specialist. pdf-craft is purpose-built for converting scanned books to Markdown or EPUB. It uses DeepSeek OCR and runs entirely locally — no network requests, no API keys, no cloud dependency.

Starting with v1.0, pdf-craft dropped LLM-based text correction entirely in favor of pure OCR. The result is faster, more predictable conversion. It automatically generates EPUB table of contents and handles footnotes, formulas, and tables.

Best for: Digitizing physical books, library/archive projects, anyone who wants fully offline conversion.

Strengths:

Purpose-built for scanned books (headers/footers auto-filtered)
EPUB output with auto-generated TOC
Fully local — zero network calls
DeepSeek OCR handles tables and formulas well
Clean, focused codebase

Trade-offs:

PDF-only (no DOCX, PPTX, etc.)
Requires GPU for practical use
Smaller community than Marker/Docling/MinerU
No LLM correction option (removed in v1.0)

5. PyMuPDF4LLM (pymupdf/pymupdf4llm) — ⭐ 2K+

The lightweight option. PyMuPDF4LLM is a thin extension of PyMuPDF that extracts PDF content as LLM-ready Markdown. No ML models, no GPU, no heavy dependencies. Just pip install pymupdf4llm and go.

It detects headers, paragraphs, tables, and images using PyMuPDF’s layout engine. For native (non-scanned) PDFs with embedded text, it’s by far the fastest option.

Best for: Native PDFs, quick extraction, CPU-only environments, preprocessing before feeding to an LLM.

Strengths:

Fastest conversion (no ML overhead)
Minimal dependencies
Works everywhere — CPU only, no GPU needed
Good table detection for structured PDFs
Trivial to integrate into any Python pipeline

Trade-offs:

No OCR — useless for scanned documents
Accuracy depends heavily on PDF structure quality
No formula recognition
Less sophisticated layout analysis

Head-to-Head Comparison

Feature	Marker	Docling	MinerU	pdf-craft	PyMuPDF4LLM
GitHub Stars	19K+	20K+	30K+	4.6K	2K+
OCR Engine	Surya	Custom	PaddleOCR	DeepSeek	None
GPU Required	No (optional)	No (optional)	Recommended	Yes	No
Scanned PDFs	✅	✅	✅	✅ (specialized)	❌
Multi-format	PDF, DOCX, PPTX, EPUB, images	PDF, DOCX, PPTX, HTML, audio	PDF	PDF	PDF
Tables	✅✅	✅✅	✅✅✅	✅	✅
Formulas	✅✅	✅	✅✅	✅✅	❌
EPUB Output	❌	❌	❌	✅	❌
LLM Boost	Optional	❌	❌	❌ (removed)	❌
Fully Local	✅	✅	✅	✅	✅
CJK Support	Good	Good	Excellent	Good	Basic
RAG Integration	JSON/chunks	LlamaIndex, LangChain	JSON	❌	Basic
Speed (native PDF)	Fast	Moderate	Moderate	N/A	Fastest
Speed (scanned)	Fast (GPU)	Moderate	Fast (GPU)	Fast (GPU)	N/A

Decision Tree

Start here:

Is your PDF native (has selectable text)? → Try PyMuPDF4LLM first. It’s fast, light, and often good enough. If the output quality isn’t there, move to Marker.
Are you processing scanned books? → pdf-craft is built for exactly this. If you also need multi-format support, use Marker instead.
Do you need structured output for a RAG pipeline? → Docling if you’re in the LlamaIndex/LangChain ecosystem. Marker if you want JSON schema extraction.
Complex academic papers with CJK text? → MinerU. Nothing else comes close for Chinese/Japanese/Korean layout detection.
Want one tool that does everything reasonably well? → Marker. It’s the Swiss Army knife of this category.

The Bottom Line

The PDF-to-Markdown space has matured fast. Two years ago you were choosing between “bad” and “less bad.” Now you’re choosing between genuinely good tools with different specializations.

If you’re only going to install one tool: Marker is the safest default. If you have a specific use case — books (pdf-craft), enterprise RAG (Docling), CJK (MinerU), or quick-and-dirty native PDF extraction (PyMuPDF4LLM) — pick the specialist.

All five are open source, all run locally, and all are actively maintained. The real winner is the ecosystem.

Have a favorite PDF-to-Markdown tool I missed? Let me know.