What is GLM-OCR and who built it?

GLM-OCR is a 0.9B-parameter multimodal model for document understanding built by Zhipu AI and Tsinghua University. It combines a 0.4B CogViT visual encoder with a 0.5B GLM language decoder, optimized specifically for OCR tasks including text transcription, LaTeX formula recognition, table structure recovery, and key information extraction.

How does a 0.9B model beat Gemini-3 Pro at OCR?

GLM-OCR beats Gemini-3 Pro (94.62 vs 90.33 on OmniDocBench v1.5) by being purpose-built for one task rather than general. Two key design choices: (1) Multi-Token Prediction -- predicting multiple tokens per step instead of one at a time, which is far more efficient for deterministic OCR tasks where outputs are predictable. (2) A two-stage pipeline -- PP-DocLayout-V3 first analyzes the layout, then GLM-OCR handles parallel region-level recognition. Specialization beats generalization when the task is well-defined.

What benchmarks did GLM-OCR win and where does it fall short?

GLM-OCR leads OmniDocBench v1.5 overall (94.62), outperforming Qwen3-VL-235B (89.15), Gemini-3 Pro (90.33), MinerU 2.5 (90.67), and PaddleOCR-VL-1.5 (94.50). The honest caveat: MinerU 2.5 scores higher on PubTabNet (table structure), and Gemini-3 Pro leads on reference-only KIE (Key Information Extraction) scores. It's not a clean sweep -- but the overall benchmark win with a 261x smaller model is the story.

What document types does GLM-OCR handle?

Financial reports, scientific articles, contracts, invoices, and any visually rich document. Specific recognition modes: plain text, LaTeX formulas, table structure recovery, and key information extraction (KIE). Supports 8K resolution and 8+ languages including Chinese and English.

Can GLM-OCR run locally?

Yes -- that's the point. At 0.9B parameters, it runs on consumer hardware without a GPU. Zhipu AI provides an SDK for local deployment as well as a Model-as-a-Service (MaaS) API. It also supports fine-tuning. The compact architecture was explicitly designed for both edge deployment and large-scale production systems.

What is Multi-Token Prediction and why does it matter for OCR?

Standard LLM decoding predicts one token at a time -- fine for open-ended generation where each token depends on the previous. But OCR output is largely deterministic: if the document says 'Invoice #4821', there's only one right answer. Multi-Token Prediction (MTP) predicts multiple tokens per decoding step using shared parameters, dramatically improving throughput without increasing memory. For OCR specifically, this is a much better fit than autoregressive one-at-a-time generation.

A 0.9B Model Just Beat Gemini at OCR. Here's How GLM-OCR Did It.

By Prahlad Menon Published 2026-03-19 3 min read

A 0.9-billion parameter model just beat Gemini-3 Pro at document OCR. By a meaningful margin. On a public benchmark. While also beating Qwen3-VL with 235 billion parameters — a model 261 times its size.

GLM-OCR (arXiv:2603.10910) from Zhipu AI and Tsinghua University scores 94.62 on OmniDocBench v1.5 — the current leading benchmark for document understanding. Gemini-3 Pro scores 90.33. Qwen3-VL-235B scores 89.15.

This is worth understanding, because it’s not magic. There are specific architectural reasons it works, and they point to a broader principle that applies well beyond OCR.

What GLM-OCR is

Architecture: 0.4B CogViT visual encoder + 0.5B GLM language decoder = 0.9B total

What it does:

Document parsing — financial reports, scientific papers, contracts, invoices
Text transcription (8+ languages, 8K resolution)
LaTeX formula recognition
Table structure recovery
Key Information Extraction (KIE)

What it doesn’t do: general vision-language reasoning, image captioning, question answering about arbitrary images. It’s a specialist, not a generalist.

How it beats models 261x its size

Two architecture decisions do most of the work:

1. Multi-Token Prediction (MTP)

Standard autoregressive decoding: predict one token → feed it back → predict the next → repeat.

For open-ended generation (writing, reasoning, code), this makes sense — each token genuinely depends on what came before. But for OCR, the output is almost entirely deterministic. If a document says “Total: $4,821.50”, the model doesn’t need to “decide” what comes next — it just needs to read what’s there.

GLM-OCR uses Multi-Token Prediction — predicting multiple tokens per decoding step using shared parameters. This dramatically increases throughput while keeping memory overhead low. The key insight: when your task is reading rather than generating, the standard decoding loop is wasteful.

2. Two-stage pipeline with parallel region processing

Rather than running the model end-to-end on a full document image, GLM-OCR uses:

PP-DocLayout-V3 first — analyzes the layout, identifies regions (text blocks, tables, formulas, figures)
GLM-OCR then processes each region in parallel

This is the same principle that makes any divide-and-conquer system faster: break the problem into independent pieces, solve them simultaneously. A 20-page document isn’t processed as one giant image — it’s split into structured regions that can be recognized concurrently.

The benchmark numbers (with honest caveats)

OmniDocBench v1.5 Overall Score:

Model	Parameters	Overall Score
GLM-OCR	0.9B	94.62
PaddleOCR-VL-1.5	~1B	94.50
MinerU 2.5	—	90.67
Gemini-3 Pro	~100B+	90.33
Qwen3-VL-235B	235B	89.15

Where GLM-OCR doesn’t win:

MinerU 2.5 leads on PubTabNet (complex table structure recognition)
Gemini-3 Pro leads on reference-only KIE scores

It’s not a clean sweep. But winning the overall benchmark by nearly 4 points over Gemini-3 Pro and by 5.5 points over a 235B model, at 0.9B, is the story.

Why the size gap is the actual headline

Qwen3-VL-235B is a 235-billion parameter vision-language model — one of the most capable general VLMs in existence. GLM-OCR beats it at document OCR with 261 times fewer parameters.

The reason isn’t that GLM-OCR is smarter. It’s that Qwen3-VL-235B has to be good at everything — visual reasoning, image understanding, multi-turn conversation, code, math. OCR is one task among thousands it was trained to handle.

GLM-OCR only has to be good at one thing. All 0.9 billion of its parameters are pointed at document understanding. The MTP decoder is tuned for deterministic reading. The two-stage pipeline is tuned for document layouts. It doesn’t waste capacity on tasks it will never be asked to do.

This is the task-specialization principle playing out at extreme scale: a purpose-built 0.9B model beats a 235B generalist on the task the 0.9B model was built for.

What this means in practice

Edge deployment: At 0.9B parameters, GLM-OCR runs locally on consumer hardware — no GPU required, no API call, no data leaving your machine. For document processing workflows where privacy matters (healthcare, legal, finance), this is significant.

Production scale: The parallel region processing architecture isn’t just fast on a single document — it scales efficiently to large-scale batch processing without the cost of routing millions of documents through a frontier model API.

Fine-tuning: Zhipu AI provides fine-tuning capabilities, so you can adapt it to domain-specific document formats (medical records, legal contracts, specific invoice templates) without training from scratch.

SDK + MaaS: Available as a local Python SDK (pip install glm-ocr) or as a hosted API if you don’t want to run it locally.

The broader pattern

We’ve been tracking a consistent pattern across AI research in early 2026:

The most capable models for specific tasks are increasingly not the largest general models. They’re purpose-built smaller models with architectural choices tuned to the task.

V-JEPA 2.1 beats general VLMs on action anticipation and robot grasping — by predicting in abstract representation space rather than generating pixels
IsoDDE (Isomorphic Labs) beats AlphaFold 3 on drug design — by the same JEPA-style philosophy applied to molecular representations
GLM-OCR beats Gemini-3 Pro and Qwen3-VL-235B on document OCR — by replacing autoregressive decoding with MTP and adding layout-aware parallelism

The message from all three: the right inductive bias for the task beats raw scale. Frontier models are getting better at everything — but specialized models are getting better at specific things faster, and often with orders-of-magnitude less compute.

For anyone building document processing pipelines, invoice extraction, scientific paper parsing, or any high-volume OCR workflow: GLM-OCR is worth evaluating today. The repo is at github.com/zai-org/GLM-OCR.

Sources: GLM-OCR Technical Report (arXiv:2603.10910) · GitHub: zai-org/GLM-OCR · Zhipu AI / Tsinghua University