Chart Data Extraction: Ground Truth Validation of Open-Weight Vision Models

By Prahlad Menon 3 min read

Chart Data Extraction: Ground Truth Validation of Open-Weight Vision Models

TL;DR: We validate that open-weight vision models (Qwen3-VL, Nemotron) can extract chart data with 99-100% accuracy, enabling a 30Γ— cost reduction compared to Claude while supporting fully offline/air-gapped deployments.

Abstract

Extracting structured data from chart images is a critical task in document processing, scientific research, and business intelligence. This report presents a systematic validation methodology for evaluating vision-language models (VLMs) on chart data extraction. By generating charts from known ground truth data, we can precisely measure extraction accuracy using Mean Absolute Percentage Error (MAPE). Our evaluation of four open-weight models demonstrates that Qwen3-VL 32B and NVIDIA Nemotron 12B VL achieve near-perfect accuracy (99.8-100%) across bar, line, and logarithmic scale charts.


1. Introduction

The Problem

Chart images in PDFs, papers, and reports contain valuable quantitative data that is often inaccessible without manual transcription. While proprietary models like Claude Sonnet can extract this data, they present challenges:

  • Cost: ~$0.03 per chart extraction
  • Privacy: Data must be sent to external APIs
  • Availability: Cannot run in air-gapped environments

Our Approach

We evaluate open-weight vision models that can:

  1. Run locally via Ollama on consumer hardware
  2. Reduce cost to $0.0004/chart via OpenRouter (~30Γ— cheaper)
  3. Achieve equivalent or better accuracy than proprietary alternatives

2. Methodology

2.1 Ground Truth Generation

The only valid way to measure extraction accuracy is to compare against known data. We generate test charts programmatically using matplotlib with exact, predetermined values:

# Ground Truth Data (we know exactly what the extraction should return)
GROUND_TRUTH = {
    "bar-simple": {
        "chartType": "bar",
        "title": "Quarterly Sales by Product",
        "xAxis": {"categories": ["Q1", "Q2", "Q3", "Q4"]},
        "series": [
            {"name": "Product A", "data": [120, 145, 132, 168]},
            {"name": "Product B", "data": [98, 112, 125, 142]},
            {"name": "Product C", "data": [85, 92, 88, 95]},
        ]
    },
    "line-multi": {
        "chartType": "line",
        "title": "Multi-Line Treatment Comparison",
        "xAxis": {"categories": [0, 5, 10, 15, 20, 25, 30]},
        "series": [
            {"name": "Treatment A", "data": [100, 85, 72, 61, 53, 47, 42]},
            {"name": "Treatment B", "data": [100, 92, 85, 79, 74, 70, 67]},
            {"name": "Control", "data": [100, 78, 58, 42, 30, 22, 17]},
        ]
    },
    "line-log": {
        "chartType": "line",
        "title": "Viscosity vs Concentration (Log Scale)",
        "xAxis": {"categories": [0, 5, 10, 15, 20]},
        "yAxis": {"scale": "logarithmic"},
        "series": [
            {"name": "Viscosity", "data": [10000, 1000, 100, 10, 1]},
        ]
    },
}

2.2 Chart Generation Pipeline

Charts are rendered to PNG images at 150 DPI using matplotlib. Each chart is generated from the exact ground truth values above, creating a controlled test environment where we know precisely what the extraction should return.

Test Chart 1: Grouped Bar Chart

  • 3 series (Product A, B, C)
  • 4 categories (Q1-Q4)
  • 12 total data points

Test Chart 2: Multi-Line Chart

  • 3 overlapping series (Treatment A, Treatment B, Control)
  • 7 data points per series
  • 21 total data points
  • This is the hardest test case due to overlapping curves

Test Chart 3: Logarithmic Scale

  • 1 series (Viscosity)
  • 5 data points spanning 4 orders of magnitude (10000 β†’ 1)
  • Tests scale detection and log-axis handling

2.3 Models Under Evaluation

ModelParametersVRAM RequiredCloud CostLocal Capable
Qwen3-VL 32B32B24GB$0.10/M tokensβœ… Ollama
Qwen3-VL 8B8B8GB$0.08/M tokensβœ… Ollama
Nemotron 12B VL12B16GBFREEβœ… Ollama
Llama 4 Scoutβ€”β€”$0.10/M tokensβœ… Ollama

All models were accessed via OpenRouter API for this evaluation, enabling consistent comparison. Each can also be deployed locally via Ollama for air-gapped environments.

2.4 Accuracy Metric

We use Mean Absolute Percentage Error (MAPE) to measure extraction accuracy:

MAPE = (1/n) Γ— Ξ£ |ground_truth[i] - extracted[i]| / ground_truth[i] Γ— 100

Accuracy = 100% - MAPE

This metric is:

  • Scale-invariant: Works across different value ranges
  • Interpretable: 99% accuracy means 1% average error
  • Sensitive: Small deviations are detected

3. Results

3.1 Overall Model Performance

ModelBar ChartMulti-LineLog ScaleAverage
Qwen3-VL 32B99.9%99.8%100%99.9%
Qwen3-VL 8B99.6%99.9%100%99.8%
Nemotron 12B VL100%100%100%100%
Llama 4 Scout❌ Failed92.5%100%*96.2%

*Llama 4 Scout had one JSON parse failure on bar charts and missed one data point on log scale.

3.2 Detailed Comparison: Bar Chart

Ground Truth: Product A = [120, 145, 132, 168]

ModelExtracted ValuesAccuracy
Ground Truth[120, 145, 132, 168]β€”
Qwen3-VL 32B[120, 145, 132, 168]βœ… 100%
Qwen3-VL 8B[120, 145, 130, 170]99.3%
Nemotron 12B[120, 145, 132, 168]βœ… 100%
Llama 4 Scout❌ JSON parse failed0%

Observation: Qwen3-VL 8B shows minor rounding (132β†’130, 168β†’170) but maintains >99% accuracy. This is within acceptable visual read error.

3.3 Detailed Comparison: Multi-Line Chart

This is the hardest test case with three overlapping series requiring the model to distinguish curves and read values at intersections.

Ground Truth: Control = [100, 78, 58, 42, 30, 22, 17]

ModelExtracted ValuesAccuracy
Ground Truth[100, 78, 58, 42, 30, 22, 17]β€”
Qwen3-VL 32B[100, 78, 58, 42, 30, 22, 17]βœ… 100%
Qwen3-VL 8B[100, 78, 59, 42, 30, 22, 17]99.8%
Nemotron 12B[100, 78, 58, 42, 30, 22, 17]βœ… 100%
Llama 4 Scout[100, 60, 40, 30, 25, 20, 15]82.8%

Observation: Llama 4 Scout struggles with multi-line charts, producing significant errors (78β†’60, 58β†’40). This suggests weaker fine-grained visual reasoning for overlapping elements.

3.4 Inference Speed and Cost

ModelAvg TimeCost per ChartRelative to Claude
Qwen3-VL 32B8.8s$0.000430Γ— cheaper
Qwen3-VL 8B9.8s$0.000560Γ— cheaper
Nemotron 12B VL34.4s$0.0000∞ cheaper
Llama 4 Scout8.3s$0.000430Γ— cheaper
Claude Sonnet~5s~$0.03Baseline

4. Recommendations

For Production Deployments

Primary: Qwen3-VL 32B

  • Fastest inference (8.8s average)
  • Highest consistent accuracy (99.9%)
  • Local deployment: ollama pull qwen3-vl:32b (requires 24GB VRAM)

For Consumer Hardware

Budget: Qwen3-VL 8B

  • Runs on 8GB VRAM (RTX 3070/4060 class)
  • Same architecture, slightly lower accuracy
  • Local deployment: ollama pull qwen3-vl:8b

For Zero-Cost Deployments

Free: NVIDIA Nemotron 12B VL

  • OpenRouter free tier (no API costs)
  • Perfect accuracy in testing (100%)
  • Trade-off: Slower inference (34s average)

5. Limitations and Future Work

  1. Test Set Size: This validation uses 3 chart types with controlled data. Real-world charts have noise, compression artifacts, and complex layouts.

  2. Scatter Plots: Not included in ground truth validation. Visual point cloud extraction is more challenging than line/bar reading.

  3. Annotation Density: Charts with many overlapping labels may degrade accuracy.

  4. Color Extraction: We validated value extraction but not color accuracy (series identification).


6. Conclusion

Open-weight vision models have reached production-quality accuracy for chart data extraction. The combination of:

  • 99-100% accuracy on ground truth validation
  • 30Γ— cost reduction vs. proprietary APIs
  • Local deployment capability for privacy/air-gap requirements

makes Qwen3-VL 32B the recommended choice for chart extraction pipelines. For budget deployments, Qwen3-VL 8B provides excellent accuracy on consumer hardware, and NVIDIA Nemotron 12B VL offers perfect accuracy at zero cost via OpenRouter’s free tier.


Interactive Demo

See the full validation results with side-by-side comparisons: Ground Truth Validation Dashboard


Repository: github.com/menonpg/chartmining-claude