A 0.9B Model Just Beat Gemini at OCR. Here's How GLM-OCR Did It.
A 0.9-billion parameter model just beat Gemini-3 Pro at document OCR. By a meaningful margin. On a public benchmark. While also beating Qwen3-VL with 235 billion parameters — a model 261 times its size.
GLM-OCR (arXiv:2603.10910) from Zhipu AI and Tsinghua University scores 94.62 on OmniDocBench v1.5 — the current leading benchmark for document understanding. Gemini-3 Pro scores 90.33. Qwen3-VL-235B scores 89.15.
This is worth understanding, because it’s not magic. There are specific architectural reasons it works, and they point to a broader principle that applies well beyond OCR.
What GLM-OCR is
Architecture: 0.4B CogViT visual encoder + 0.5B GLM language decoder = 0.9B total
What it does:
- Document parsing — financial reports, scientific papers, contracts, invoices
- Text transcription (8+ languages, 8K resolution)
- LaTeX formula recognition
- Table structure recovery
- Key Information Extraction (KIE)
What it doesn’t do: general vision-language reasoning, image captioning, question answering about arbitrary images. It’s a specialist, not a generalist.
How it beats models 261x its size
Two architecture decisions do most of the work:
1. Multi-Token Prediction (MTP)
Standard autoregressive decoding: predict one token → feed it back → predict the next → repeat.
For open-ended generation (writing, reasoning, code), this makes sense — each token genuinely depends on what came before. But for OCR, the output is almost entirely deterministic. If a document says “Total: $4,821.50”, the model doesn’t need to “decide” what comes next — it just needs to read what’s there.
GLM-OCR uses Multi-Token Prediction — predicting multiple tokens per decoding step using shared parameters. This dramatically increases throughput while keeping memory overhead low. The key insight: when your task is reading rather than generating, the standard decoding loop is wasteful.
2. Two-stage pipeline with parallel region processing
Rather than running the model end-to-end on a full document image, GLM-OCR uses:
- PP-DocLayout-V3 first — analyzes the layout, identifies regions (text blocks, tables, formulas, figures)
- GLM-OCR then processes each region in parallel
This is the same principle that makes any divide-and-conquer system faster: break the problem into independent pieces, solve them simultaneously. A 20-page document isn’t processed as one giant image — it’s split into structured regions that can be recognized concurrently.
The benchmark numbers (with honest caveats)
OmniDocBench v1.5 Overall Score:
| Model | Parameters | Overall Score |
|---|---|---|
| GLM-OCR | 0.9B | 94.62 |
| PaddleOCR-VL-1.5 | ~1B | 94.50 |
| MinerU 2.5 | — | 90.67 |
| Gemini-3 Pro | ~100B+ | 90.33 |
| Qwen3-VL-235B | 235B | 89.15 |
Where GLM-OCR doesn’t win:
- MinerU 2.5 leads on PubTabNet (complex table structure recognition)
- Gemini-3 Pro leads on reference-only KIE scores
It’s not a clean sweep. But winning the overall benchmark by nearly 4 points over Gemini-3 Pro and by 5.5 points over a 235B model, at 0.9B, is the story.
Why the size gap is the actual headline
Qwen3-VL-235B is a 235-billion parameter vision-language model — one of the most capable general VLMs in existence. GLM-OCR beats it at document OCR with 261 times fewer parameters.
The reason isn’t that GLM-OCR is smarter. It’s that Qwen3-VL-235B has to be good at everything — visual reasoning, image understanding, multi-turn conversation, code, math. OCR is one task among thousands it was trained to handle.
GLM-OCR only has to be good at one thing. All 0.9 billion of its parameters are pointed at document understanding. The MTP decoder is tuned for deterministic reading. The two-stage pipeline is tuned for document layouts. It doesn’t waste capacity on tasks it will never be asked to do.
This is the task-specialization principle playing out at extreme scale: a purpose-built 0.9B model beats a 235B generalist on the task the 0.9B model was built for.
What this means in practice
Edge deployment: At 0.9B parameters, GLM-OCR runs locally on consumer hardware — no GPU required, no API call, no data leaving your machine. For document processing workflows where privacy matters (healthcare, legal, finance), this is significant.
Production scale: The parallel region processing architecture isn’t just fast on a single document — it scales efficiently to large-scale batch processing without the cost of routing millions of documents through a frontier model API.
Fine-tuning: Zhipu AI provides fine-tuning capabilities, so you can adapt it to domain-specific document formats (medical records, legal contracts, specific invoice templates) without training from scratch.
SDK + MaaS: Available as a local Python SDK (pip install glm-ocr) or as a hosted API if you don’t want to run it locally.
The broader pattern
We’ve been tracking a consistent pattern across AI research in early 2026:
The most capable models for specific tasks are increasingly not the largest general models. They’re purpose-built smaller models with architectural choices tuned to the task.
- V-JEPA 2.1 beats general VLMs on action anticipation and robot grasping — by predicting in abstract representation space rather than generating pixels
- IsoDDE (Isomorphic Labs) beats AlphaFold 3 on drug design — by the same JEPA-style philosophy applied to molecular representations
- GLM-OCR beats Gemini-3 Pro and Qwen3-VL-235B on document OCR — by replacing autoregressive decoding with MTP and adding layout-aware parallelism
The message from all three: the right inductive bias for the task beats raw scale. Frontier models are getting better at everything — but specialized models are getting better at specific things faster, and often with orders-of-magnitude less compute.
For anyone building document processing pipelines, invoice extraction, scientific paper parsing, or any high-volume OCR workflow: GLM-OCR is worth evaluating today. The repo is at github.com/zai-org/GLM-OCR.
Sources: GLM-OCR Technical Report (arXiv:2603.10910) · GitHub: zai-org/GLM-OCR · Zhipu AI / Tsinghua University
Related: LeCun Just Raised $1B to Replace LLMs — V-JEPA 2.1 Update · Unsloth Studio: Fine-Tune 500+ LLMs Without Code