PinchBench: The First Real Benchmarks for OpenClaw Agents — And the Results Are Surprising
We’ve all been there: staring at a pricing page, wondering if Claude Opus is really worth 10x more than GPT-4o for your agent workflows. Until now, the answer was “vibes and Twitter threads.”
PinchBench changes that. It’s the first benchmarking system built specifically for OpenClaw coding agents — and the results challenge everything you thought you knew about model selection.
What PinchBench Actually Tests
Unlike synthetic benchmarks (MMLU, HumanEval, etc.), PinchBench throws real tasks at agents:
- Scheduling meetings and parsing calendar events
- Researching stock prices and market data
- Writing blog posts and emails
- Building weather scripts and file structures
- Processing spreadsheets and PDFs
- Managing inbox triage
- Long-term memory retrieval
23 tasks across 8 categories. Each graded automatically, by LLM judge, or both. This is what your agent actually does all day.
The Leaderboard (Prepare to Be Surprised)
| Rank | Model | Success Rate | Cost/1M tokens |
|---|---|---|---|
| 1 | google/gemini-3-flash-preview | 95.1% | $0.72 |
| 2 | minimax/minimax-m2.1 | 93.6% | $0.14 |
| 3 | moonshotai/kimi-k2.5 | 93.4% | $0.20 |
| 4 | anthropic/claude-sonnet-4.5 | 92.7% | $3.07 |
| 5 | google/gemini-3-pro-preview | 91.7% | $1.48 |
| 6 | anthropic/claude-haiku-4.5 | 90.8% | $0.64 |
| 7 | anthropic/claude-opus-4.6 | 90.6% | $5.89 |
| 9 | openai/gpt-5-nano | 85.8% | $0.03 |
| 12 | openai/gpt-4o | 85.2% | $2.08 |
| 31 | minimax/minimax-m2.5 | 35.5% | — |
Full results: pinchbench.com
The Surprises
1. Flash Beats Pro (At Half the Cost)
Gemini 3 Flash Preview (95.1%, $0.72) outperforms Gemini 3 Pro Preview (91.7%, $1.48). The cheaper, faster model wins on agent tasks. This pattern repeats across the board.
2. Minimax M2.1 Is the Value King
At 93.6% success rate for just $0.14 per million tokens, Minimax M2.1 is arguably the best deal in AI. Claude Sonnet 4.5 scores slightly lower (92.7%) and costs 22x more ($3.07).
For most OpenClaw workflows, you could run Minimax and pocket the difference.
3. GPT-5-nano: The $0.03 Contender
OpenAI’s smallest GPT-5 variant hits 85.8% — beating GPT-4o (85.2%) while costing 70x less. If your tasks don’t need top-tier reasoning, this is free money.
4. Bigger ≠ Better (The M2.5 Mystery)
Here’s the weird one: Minimax M2.5 scores only 35.5% — dramatically worse than M2.1 at 93.6%. Newer and bigger isn’t always better for agent workloads. The M2.1 model seems specifically well-tuned for tool use and multi-step reasoning, while M2.5 might be optimized for different tasks entirely.
5. Opus Isn’t Worth 40x the Cost
Claude Opus 4.6 scores 90.6% at $5.89 per million tokens. Minimax M2.1 scores higher (93.6%) at $0.14. That’s a 42x price difference for worse performance on agent tasks.
What This Means for Your Stack
If you’re optimizing for cost:
- GPT-5-nano ($0.03) for simple tasks
- Minimax M2.1 ($0.14) for most production workloads
- Gemini 3 Flash ($0.72) when you need the absolute best
If you’re optimizing for quality:
- Gemini 3 Flash Preview (95.1%) — yes, Flash, not Pro
- Minimax M2.1 (93.6%) — second place, first in value
- Kimi K2.5 (93.4%) — dark horse from Moonshot AI
If you’re currently using Claude Opus or GPT-4o: You might be spending 20-40x more than necessary. Run PinchBench on your own tasks and see.
Running It Yourself
git clone https://github.com/pinchbench/skill.git
cd skill
./scripts/run.sh --model minimax/minimax-m2.1
Results auto-upload to the public leaderboard (skip with --no-upload).
The Bigger Picture
PinchBench matters because it tests what agents actually do — not what LLMs know. A model can ace HumanEval and still fail at “schedule a meeting for next Tuesday and email the attendees.”
The benchmarks reveal that tool use and multi-step reasoning are separate skills from raw intelligence. Some models nail both. Some nail neither. And price is almost completely uncorrelated with agent performance.
Stop guessing. Start measuring.
GitHub: pinchbench/skill
Leaderboard: pinchbench.com
Made by the team at kilo.ai. MIT licensed.