Chart Data Extraction: Ground Truth Validation of Open-Weight Vision Models

By Prahlad Menon Published 2026-06-01 3 min read

Chart Data Extraction: Ground Truth Validation of Open-Weight Vision Models

TL;DR: We validate that open-weight vision models (Qwen3-VL, Nemotron) can extract chart data with 99-100% accuracy, enabling a 30× cost reduction compared to Claude while supporting fully offline/air-gapped deployments.

Abstract

Extracting structured data from chart images is a critical task in document processing, scientific research, and business intelligence. This report presents a systematic validation methodology for evaluating vision-language models (VLMs) on chart data extraction. By generating charts from known ground truth data, we can precisely measure extraction accuracy using Mean Absolute Percentage Error (MAPE). Our evaluation of four open-weight models demonstrates that Qwen3-VL 32B and NVIDIA Nemotron 12B VL achieve near-perfect accuracy (99.8-100%) across bar, line, and logarithmic scale charts.

1. Introduction

The Problem

Chart images in PDFs, papers, and reports contain valuable quantitative data that is often inaccessible without manual transcription. While proprietary models like Claude Sonnet can extract this data, they present challenges:

Cost: ~$0.03 per chart extraction
Privacy: Data must be sent to external APIs
Availability: Cannot run in air-gapped environments

Our Approach

We evaluate open-weight vision models that can:

Run locally via Ollama on consumer hardware
Reduce cost to $0.0004/chart via OpenRouter (~30× cheaper)
Achieve equivalent or better accuracy than proprietary alternatives

2. Methodology

2.1 Ground Truth Generation

The only valid way to measure extraction accuracy is to compare against known data. We generate test charts programmatically using matplotlib with exact, predetermined values:

# Ground Truth Data (we know exactly what the extraction should return)
GROUND_TRUTH = {
    "bar-simple": {
        "chartType": "bar",
        "title": "Quarterly Sales by Product",
        "xAxis": {"categories": ["Q1", "Q2", "Q3", "Q4"]},
        "series": [
            {"name": "Product A", "data": [120, 145, 132, 168]},
            {"name": "Product B", "data": [98, 112, 125, 142]},
            {"name": "Product C", "data": [85, 92, 88, 95]},
        ]
    },
    "line-multi": {
        "chartType": "line",
        "title": "Multi-Line Treatment Comparison",
        "xAxis": {"categories": [0, 5, 10, 15, 20, 25, 30]},
        "series": [
            {"name": "Treatment A", "data": [100, 85, 72, 61, 53, 47, 42]},
            {"name": "Treatment B", "data": [100, 92, 85, 79, 74, 70, 67]},
            {"name": "Control", "data": [100, 78, 58, 42, 30, 22, 17]},
        ]
    },
    "line-log": {
        "chartType": "line",
        "title": "Viscosity vs Concentration (Log Scale)",
        "xAxis": {"categories": [0, 5, 10, 15, 20]},
        "yAxis": {"scale": "logarithmic"},
        "series": [
            {"name": "Viscosity", "data": [10000, 1000, 100, 10, 1]},
        ]
    },
}

2.2 Chart Generation Pipeline

Charts are rendered to PNG images at 150 DPI using matplotlib. Each chart is generated from the exact ground truth values above, creating a controlled test environment where we know precisely what the extraction should return.

Test Chart 1: Grouped Bar Chart

3 series (Product A, B, C)
4 categories (Q1-Q4)
12 total data points

Test Chart 2: Multi-Line Chart

3 overlapping series (Treatment A, Treatment B, Control)
7 data points per series
21 total data points
This is the hardest test case due to overlapping curves

Test Chart 3: Logarithmic Scale

1 series (Viscosity)
5 data points spanning 4 orders of magnitude (10000 → 1)
Tests scale detection and log-axis handling

2.3 Models Under Evaluation

Model	Parameters	VRAM Required	Cloud Cost	Local Capable
Qwen3-VL 32B	32B	24GB	$0.10/M tokens	✅ Ollama
Qwen3-VL 8B	8B	8GB	$0.08/M tokens	✅ Ollama
Nemotron 12B VL	12B	16GB	FREE	✅ Ollama
Llama 4 Scout	—	—	$0.10/M tokens	✅ Ollama

All models were accessed via OpenRouter API for this evaluation, enabling consistent comparison. Each can also be deployed locally via Ollama for air-gapped environments.

2.4 Accuracy Metric

We use Mean Absolute Percentage Error (MAPE) to measure extraction accuracy:

MAPE = (1/n) × Σ |ground_truth[i] - extracted[i]| / ground_truth[i] × 100

Accuracy = 100% - MAPE

This metric is:

Scale-invariant: Works across different value ranges
Interpretable: 99% accuracy means 1% average error
Sensitive: Small deviations are detected

3. Results

3.1 Overall Model Performance

Model	Bar Chart	Multi-Line	Log Scale	Average
Qwen3-VL 32B	99.9%	99.8%	100%	99.9%
Qwen3-VL 8B	99.6%	99.9%	100%	99.8%
Nemotron 12B VL	100%	100%	100%	100%
Llama 4 Scout	❌ Failed	92.5%	100%*	96.2%

*Llama 4 Scout had one JSON parse failure on bar charts and missed one data point on log scale.

3.2 Detailed Comparison: Bar Chart

Ground Truth: Product A = [120, 145, 132, 168]

Model	Extracted Values	Accuracy
Ground Truth	[120, 145, 132, 168]	—
Qwen3-VL 32B	[120, 145, 132, 168]	✅ 100%
Qwen3-VL 8B	[120, 145, 130, 170]	99.3%
Nemotron 12B	[120, 145, 132, 168]	✅ 100%
Llama 4 Scout	❌ JSON parse failed	0%

Observation: Qwen3-VL 8B shows minor rounding (132→130, 168→170) but maintains >99% accuracy. This is within acceptable visual read error.

3.3 Detailed Comparison: Multi-Line Chart

This is the hardest test case with three overlapping series requiring the model to distinguish curves and read values at intersections.

Ground Truth: Control = [100, 78, 58, 42, 30, 22, 17]

Model	Extracted Values	Accuracy
Ground Truth	[100, 78, 58, 42, 30, 22, 17]	—
Qwen3-VL 32B	[100, 78, 58, 42, 30, 22, 17]	✅ 100%
Qwen3-VL 8B	[100, 78, 59, 42, 30, 22, 17]	99.8%
Nemotron 12B	[100, 78, 58, 42, 30, 22, 17]	✅ 100%
Llama 4 Scout	[100, 60, 40, 30, 25, 20, 15]	82.8%

Observation: Llama 4 Scout struggles with multi-line charts, producing significant errors (78→60, 58→40). This suggests weaker fine-grained visual reasoning for overlapping elements.

3.4 Inference Speed and Cost

Model	Avg Time	Cost per Chart	Relative to Claude
Qwen3-VL 32B	8.8s	$0.0004	30× cheaper
Qwen3-VL 8B	9.8s	$0.0005	60× cheaper
Nemotron 12B VL	34.4s	$0.0000	∞ cheaper
Llama 4 Scout	8.3s	$0.0004	30× cheaper
Claude Sonnet	~5s	~$0.03	Baseline

4. Recommendations

For Production Deployments

Primary: Qwen3-VL 32B

Fastest inference (8.8s average)
Highest consistent accuracy (99.9%)
Local deployment: ollama pull qwen3-vl:32b (requires 24GB VRAM)

For Consumer Hardware

Budget: Qwen3-VL 8B

Runs on 8GB VRAM (RTX 3070/4060 class)
Same architecture, slightly lower accuracy
Local deployment: ollama pull qwen3-vl:8b

For Zero-Cost Deployments

Free: NVIDIA Nemotron 12B VL

OpenRouter free tier (no API costs)
Perfect accuracy in testing (100%)
Trade-off: Slower inference (34s average)

5. Limitations and Future Work

Test Set Size: This validation uses 3 chart types with controlled data. Real-world charts have noise, compression artifacts, and complex layouts.
Scatter Plots: Not included in ground truth validation. Visual point cloud extraction is more challenging than line/bar reading.
Annotation Density: Charts with many overlapping labels may degrade accuracy.
Color Extraction: We validated value extraction but not color accuracy (series identification).

6. Conclusion

Open-weight vision models have reached production-quality accuracy for chart data extraction. The combination of:

99-100% accuracy on ground truth validation
30× cost reduction vs. proprietary APIs
Local deployment capability for privacy/air-gap requirements

makes Qwen3-VL 32B the recommended choice for chart extraction pipelines. For budget deployments, Qwen3-VL 8B provides excellent accuracy on consumer hardware, and NVIDIA Nemotron 12B VL offers perfect accuracy at zero cost via OpenRouter’s free tier.

Interactive Demo

See the full validation results with side-by-side comparisons: Ground Truth Validation Dashboard

Repository: github.com/menonpg/chartmining-claude