Chart Data Extraction: Ground Truth Validation of Open-Weight Vision Models
Chart Data Extraction: Ground Truth Validation of Open-Weight Vision Models
TL;DR: We validate that open-weight vision models (Qwen3-VL, Nemotron) can extract chart data with 99-100% accuracy, enabling a 30Γ cost reduction compared to Claude while supporting fully offline/air-gapped deployments.
Abstract
Extracting structured data from chart images is a critical task in document processing, scientific research, and business intelligence. This report presents a systematic validation methodology for evaluating vision-language models (VLMs) on chart data extraction. By generating charts from known ground truth data, we can precisely measure extraction accuracy using Mean Absolute Percentage Error (MAPE). Our evaluation of four open-weight models demonstrates that Qwen3-VL 32B and NVIDIA Nemotron 12B VL achieve near-perfect accuracy (99.8-100%) across bar, line, and logarithmic scale charts.
1. Introduction
The Problem
Chart images in PDFs, papers, and reports contain valuable quantitative data that is often inaccessible without manual transcription. While proprietary models like Claude Sonnet can extract this data, they present challenges:
- Cost: ~$0.03 per chart extraction
- Privacy: Data must be sent to external APIs
- Availability: Cannot run in air-gapped environments
Our Approach
We evaluate open-weight vision models that can:
- Run locally via Ollama on consumer hardware
- Reduce cost to $0.0004/chart via OpenRouter (~30Γ cheaper)
- Achieve equivalent or better accuracy than proprietary alternatives
2. Methodology
2.1 Ground Truth Generation
The only valid way to measure extraction accuracy is to compare against known data. We generate test charts programmatically using matplotlib with exact, predetermined values:
# Ground Truth Data (we know exactly what the extraction should return)
GROUND_TRUTH = {
"bar-simple": {
"chartType": "bar",
"title": "Quarterly Sales by Product",
"xAxis": {"categories": ["Q1", "Q2", "Q3", "Q4"]},
"series": [
{"name": "Product A", "data": [120, 145, 132, 168]},
{"name": "Product B", "data": [98, 112, 125, 142]},
{"name": "Product C", "data": [85, 92, 88, 95]},
]
},
"line-multi": {
"chartType": "line",
"title": "Multi-Line Treatment Comparison",
"xAxis": {"categories": [0, 5, 10, 15, 20, 25, 30]},
"series": [
{"name": "Treatment A", "data": [100, 85, 72, 61, 53, 47, 42]},
{"name": "Treatment B", "data": [100, 92, 85, 79, 74, 70, 67]},
{"name": "Control", "data": [100, 78, 58, 42, 30, 22, 17]},
]
},
"line-log": {
"chartType": "line",
"title": "Viscosity vs Concentration (Log Scale)",
"xAxis": {"categories": [0, 5, 10, 15, 20]},
"yAxis": {"scale": "logarithmic"},
"series": [
{"name": "Viscosity", "data": [10000, 1000, 100, 10, 1]},
]
},
}
2.2 Chart Generation Pipeline
Charts are rendered to PNG images at 150 DPI using matplotlib. Each chart is generated from the exact ground truth values above, creating a controlled test environment where we know precisely what the extraction should return.
Test Chart 1: Grouped Bar Chart
- 3 series (Product A, B, C)
- 4 categories (Q1-Q4)
- 12 total data points
Test Chart 2: Multi-Line Chart
- 3 overlapping series (Treatment A, Treatment B, Control)
- 7 data points per series
- 21 total data points
- This is the hardest test case due to overlapping curves
Test Chart 3: Logarithmic Scale
- 1 series (Viscosity)
- 5 data points spanning 4 orders of magnitude (10000 β 1)
- Tests scale detection and log-axis handling
2.3 Models Under Evaluation
| Model | Parameters | VRAM Required | Cloud Cost | Local Capable |
|---|---|---|---|---|
| Qwen3-VL 32B | 32B | 24GB | $0.10/M tokens | β Ollama |
| Qwen3-VL 8B | 8B | 8GB | $0.08/M tokens | β Ollama |
| Nemotron 12B VL | 12B | 16GB | FREE | β Ollama |
| Llama 4 Scout | β | β | $0.10/M tokens | β Ollama |
All models were accessed via OpenRouter API for this evaluation, enabling consistent comparison. Each can also be deployed locally via Ollama for air-gapped environments.
2.4 Accuracy Metric
We use Mean Absolute Percentage Error (MAPE) to measure extraction accuracy:
MAPE = (1/n) Γ Ξ£ |ground_truth[i] - extracted[i]| / ground_truth[i] Γ 100
Accuracy = 100% - MAPE
This metric is:
- Scale-invariant: Works across different value ranges
- Interpretable: 99% accuracy means 1% average error
- Sensitive: Small deviations are detected
3. Results
3.1 Overall Model Performance
| Model | Bar Chart | Multi-Line | Log Scale | Average |
|---|---|---|---|---|
| Qwen3-VL 32B | 99.9% | 99.8% | 100% | 99.9% |
| Qwen3-VL 8B | 99.6% | 99.9% | 100% | 99.8% |
| Nemotron 12B VL | 100% | 100% | 100% | 100% |
| Llama 4 Scout | β Failed | 92.5% | 100%* | 96.2% |
*Llama 4 Scout had one JSON parse failure on bar charts and missed one data point on log scale.
3.2 Detailed Comparison: Bar Chart
Ground Truth: Product A = [120, 145, 132, 168]
| Model | Extracted Values | Accuracy |
|---|---|---|
| Ground Truth | [120, 145, 132, 168] | β |
| Qwen3-VL 32B | [120, 145, 132, 168] | β 100% |
| Qwen3-VL 8B | [120, 145, 130, 170] | 99.3% |
| Nemotron 12B | [120, 145, 132, 168] | β 100% |
| Llama 4 Scout | β JSON parse failed | 0% |
Observation: Qwen3-VL 8B shows minor rounding (132β130, 168β170) but maintains >99% accuracy. This is within acceptable visual read error.
3.3 Detailed Comparison: Multi-Line Chart
This is the hardest test case with three overlapping series requiring the model to distinguish curves and read values at intersections.
Ground Truth: Control = [100, 78, 58, 42, 30, 22, 17]
| Model | Extracted Values | Accuracy |
|---|---|---|
| Ground Truth | [100, 78, 58, 42, 30, 22, 17] | β |
| Qwen3-VL 32B | [100, 78, 58, 42, 30, 22, 17] | β 100% |
| Qwen3-VL 8B | [100, 78, 59, 42, 30, 22, 17] | 99.8% |
| Nemotron 12B | [100, 78, 58, 42, 30, 22, 17] | β 100% |
| Llama 4 Scout | [100, 60, 40, 30, 25, 20, 15] | 82.8% |
Observation: Llama 4 Scout struggles with multi-line charts, producing significant errors (78β60, 58β40). This suggests weaker fine-grained visual reasoning for overlapping elements.
3.4 Inference Speed and Cost
| Model | Avg Time | Cost per Chart | Relative to Claude |
|---|---|---|---|
| Qwen3-VL 32B | 8.8s | $0.0004 | 30Γ cheaper |
| Qwen3-VL 8B | 9.8s | $0.0005 | 60Γ cheaper |
| Nemotron 12B VL | 34.4s | $0.0000 | β cheaper |
| Llama 4 Scout | 8.3s | $0.0004 | 30Γ cheaper |
| Claude Sonnet | ~5s | ~$0.03 | Baseline |
4. Recommendations
For Production Deployments
Primary: Qwen3-VL 32B
- Fastest inference (8.8s average)
- Highest consistent accuracy (99.9%)
- Local deployment:
ollama pull qwen3-vl:32b(requires 24GB VRAM)
For Consumer Hardware
Budget: Qwen3-VL 8B
- Runs on 8GB VRAM (RTX 3070/4060 class)
- Same architecture, slightly lower accuracy
- Local deployment:
ollama pull qwen3-vl:8b
For Zero-Cost Deployments
Free: NVIDIA Nemotron 12B VL
- OpenRouter free tier (no API costs)
- Perfect accuracy in testing (100%)
- Trade-off: Slower inference (34s average)
5. Limitations and Future Work
-
Test Set Size: This validation uses 3 chart types with controlled data. Real-world charts have noise, compression artifacts, and complex layouts.
-
Scatter Plots: Not included in ground truth validation. Visual point cloud extraction is more challenging than line/bar reading.
-
Annotation Density: Charts with many overlapping labels may degrade accuracy.
-
Color Extraction: We validated value extraction but not color accuracy (series identification).
6. Conclusion
Open-weight vision models have reached production-quality accuracy for chart data extraction. The combination of:
- 99-100% accuracy on ground truth validation
- 30Γ cost reduction vs. proprietary APIs
- Local deployment capability for privacy/air-gap requirements
makes Qwen3-VL 32B the recommended choice for chart extraction pipelines. For budget deployments, Qwen3-VL 8B provides excellent accuracy on consumer hardware, and NVIDIA Nemotron 12B VL offers perfect accuracy at zero cost via OpenRouterβs free tier.
Interactive Demo
See the full validation results with side-by-side comparisons: Ground Truth Validation Dashboard
Repository: github.com/menonpg/chartmining-claude