What does PinchBench test that other benchmarks don't?

Real agent tasks: scheduling meetings, researching stocks, writing emails, building scripts, processing spreadsheets, inbox triage, memory retrieval. 23 tasks across 8 categories. Tests what agents actually do, not what LLMs know (MMLU, HumanEval).

What are PinchBench's surprising results?

Gemini Flash beats Pro at half the cost (95.1% vs 91.7%). Minimax M2.1 at $0.14/M tokens beats Claude Sonnet 4.5 at $3.07 (22x cheaper, higher score). GPT-5-nano at $0.03 beats GPT-4o. Bigger ≠ better.

Why did Minimax M2.5 score so poorly compared to M2.1?

M2.5 scored only 35.5% vs M2.1's 93.6%. Newer/bigger isn't always better for agent workloads. M2.1 seems specifically well-tuned for tool use and multi-step reasoning; M2.5 may be optimized for different tasks.

What should I use if optimizing for cost?

GPT-5-nano ($0.03) for simple tasks, Minimax M2.1 ($0.14) for most production workloads, Gemini 3 Flash ($0.72) when you need the absolute best. If using Claude Opus or GPT-4o, you may be spending 20-40x more than necessary.

What should I use if optimizing for quality?

Gemini 3 Flash Preview (95.1%) — yes Flash, not Pro. Minimax M2.1 (93.6%) — second place, first in value. Kimi K2.5 (93.4%) — dark horse from Moonshot AI.

What does PinchBench reveal about model selection?

Tool use and multi-step reasoning are separate skills from raw intelligence. Price is almost completely uncorrelated with agent performance. Stop guessing, start measuring on your actual tasks.

PinchBench: The First Real Benchmarks for OpenClaw Agents — And the Results Are Surprising

By Prahlad Menon Published 2026-03-11 1 min read

We’ve all been there: staring at a pricing page, wondering if Claude Opus is really worth 10x more than GPT-4o for your agent workflows. Until now, the answer was “vibes and Twitter threads.”

PinchBench changes that. It’s the first benchmarking system built specifically for OpenClaw coding agents — and the results challenge everything you thought you knew about model selection.

What PinchBench Actually Tests

Unlike synthetic benchmarks (MMLU, HumanEval, etc.), PinchBench throws real tasks at agents:

Scheduling meetings and parsing calendar events
Researching stock prices and market data
Writing blog posts and emails
Building weather scripts and file structures
Processing spreadsheets and PDFs
Managing inbox triage
Long-term memory retrieval

23 tasks across 8 categories. Each graded automatically, by LLM judge, or both. This is what your agent actually does all day.

The Leaderboard (Prepare to Be Surprised)

Rank	Model	Success Rate	Cost/1M tokens
1	google/gemini-3-flash-preview	95.1%	$0.72
2	minimax/minimax-m2.1	93.6%	$0.14
3	moonshotai/kimi-k2.5	93.4%	$0.20
4	anthropic/claude-sonnet-4.5	92.7%	$3.07
5	google/gemini-3-pro-preview	91.7%	$1.48
6	anthropic/claude-haiku-4.5	90.8%	$0.64
7	anthropic/claude-opus-4.6	90.6%	$5.89
9	openai/gpt-5-nano	85.8%	$0.03
12	openai/gpt-4o	85.2%	$2.08
31	minimax/minimax-m2.5	35.5%	—

Full results: pinchbench.com

The Surprises

1. Flash Beats Pro (At Half the Cost)

Gemini 3 Flash Preview (95.1%, $0.72) outperforms Gemini 3 Pro Preview (91.7%, $1.48). The cheaper, faster model wins on agent tasks. This pattern repeats across the board.

2. Minimax M2.1 Is the Value King

At 93.6% success rate for just $0.14 per million tokens, Minimax M2.1 is arguably the best deal in AI. Claude Sonnet 4.5 scores slightly lower (92.7%) and costs 22x more ($3.07).

For most OpenClaw workflows, you could run Minimax and pocket the difference.

3. GPT-5-nano: The $0.03 Contender

OpenAI’s smallest GPT-5 variant hits 85.8% — beating GPT-4o (85.2%) while costing 70x less. If your tasks don’t need top-tier reasoning, this is free money.

4. Bigger ≠ Better (The M2.5 Mystery)

Here’s the weird one: Minimax M2.5 scores only 35.5% — dramatically worse than M2.1 at 93.6%. Newer and bigger isn’t always better for agent workloads. The M2.1 model seems specifically well-tuned for tool use and multi-step reasoning, while M2.5 might be optimized for different tasks entirely.

5. Opus Isn’t Worth 40x the Cost

Claude Opus 4.6 scores 90.6% at $5.89 per million tokens. Minimax M2.1 scores higher (93.6%) at $0.14. That’s a 42x price difference for worse performance on agent tasks.

What This Means for Your Stack

If you’re optimizing for cost:

GPT-5-nano ($0.03) for simple tasks
Minimax M2.1 ($0.14) for most production workloads
Gemini 3 Flash ($0.72) when you need the absolute best

If you’re optimizing for quality:

Gemini 3 Flash Preview (95.1%) — yes, Flash, not Pro
Minimax M2.1 (93.6%) — second place, first in value
Kimi K2.5 (93.4%) — dark horse from Moonshot AI

If you’re currently using Claude Opus or GPT-4o: You might be spending 20-40x more than necessary. Run PinchBench on your own tasks and see.

Running It Yourself

git clone https://github.com/pinchbench/skill.git
cd skill
./scripts/run.sh --model minimax/minimax-m2.1

Results auto-upload to the public leaderboard (skip with --no-upload).

The Bigger Picture

PinchBench matters because it tests what agents actually do — not what LLMs know. A model can ace HumanEval and still fail at “schedule a meeting for next Tuesday and email the attendees.”

The benchmarks reveal that tool use and multi-step reasoning are separate skills from raw intelligence. Some models nail both. Some nail neither. And price is almost completely uncorrelated with agent performance.

Stop guessing. Start measuring.

GitHub: pinchbench/skill
Leaderboard: pinchbench.com

Made by the team at kilo.ai. MIT licensed.