PinchBench: The First Real Benchmarks for OpenClaw Agents — And the Results Are Surprising

By Prahlad Menon 1 min read

We’ve all been there: staring at a pricing page, wondering if Claude Opus is really worth 10x more than GPT-4o for your agent workflows. Until now, the answer was “vibes and Twitter threads.”

PinchBench changes that. It’s the first benchmarking system built specifically for OpenClaw coding agents — and the results challenge everything you thought you knew about model selection.

What PinchBench Actually Tests

Unlike synthetic benchmarks (MMLU, HumanEval, etc.), PinchBench throws real tasks at agents:

  • Scheduling meetings and parsing calendar events
  • Researching stock prices and market data
  • Writing blog posts and emails
  • Building weather scripts and file structures
  • Processing spreadsheets and PDFs
  • Managing inbox triage
  • Long-term memory retrieval

23 tasks across 8 categories. Each graded automatically, by LLM judge, or both. This is what your agent actually does all day.

The Leaderboard (Prepare to Be Surprised)

RankModelSuccess RateCost/1M tokens
1google/gemini-3-flash-preview95.1%$0.72
2minimax/minimax-m2.193.6%$0.14
3moonshotai/kimi-k2.593.4%$0.20
4anthropic/claude-sonnet-4.592.7%$3.07
5google/gemini-3-pro-preview91.7%$1.48
6anthropic/claude-haiku-4.590.8%$0.64
7anthropic/claude-opus-4.690.6%$5.89
9openai/gpt-5-nano85.8%$0.03
12openai/gpt-4o85.2%$2.08
31minimax/minimax-m2.535.5%

Full results: pinchbench.com

The Surprises

1. Flash Beats Pro (At Half the Cost)

Gemini 3 Flash Preview (95.1%, $0.72) outperforms Gemini 3 Pro Preview (91.7%, $1.48). The cheaper, faster model wins on agent tasks. This pattern repeats across the board.

2. Minimax M2.1 Is the Value King

At 93.6% success rate for just $0.14 per million tokens, Minimax M2.1 is arguably the best deal in AI. Claude Sonnet 4.5 scores slightly lower (92.7%) and costs 22x more ($3.07).

For most OpenClaw workflows, you could run Minimax and pocket the difference.

3. GPT-5-nano: The $0.03 Contender

OpenAI’s smallest GPT-5 variant hits 85.8% — beating GPT-4o (85.2%) while costing 70x less. If your tasks don’t need top-tier reasoning, this is free money.

4. Bigger ≠ Better (The M2.5 Mystery)

Here’s the weird one: Minimax M2.5 scores only 35.5% — dramatically worse than M2.1 at 93.6%. Newer and bigger isn’t always better for agent workloads. The M2.1 model seems specifically well-tuned for tool use and multi-step reasoning, while M2.5 might be optimized for different tasks entirely.

5. Opus Isn’t Worth 40x the Cost

Claude Opus 4.6 scores 90.6% at $5.89 per million tokens. Minimax M2.1 scores higher (93.6%) at $0.14. That’s a 42x price difference for worse performance on agent tasks.

What This Means for Your Stack

If you’re optimizing for cost:

  • GPT-5-nano ($0.03) for simple tasks
  • Minimax M2.1 ($0.14) for most production workloads
  • Gemini 3 Flash ($0.72) when you need the absolute best

If you’re optimizing for quality:

  • Gemini 3 Flash Preview (95.1%) — yes, Flash, not Pro
  • Minimax M2.1 (93.6%) — second place, first in value
  • Kimi K2.5 (93.4%) — dark horse from Moonshot AI

If you’re currently using Claude Opus or GPT-4o: You might be spending 20-40x more than necessary. Run PinchBench on your own tasks and see.

Running It Yourself

git clone https://github.com/pinchbench/skill.git
cd skill
./scripts/run.sh --model minimax/minimax-m2.1

Results auto-upload to the public leaderboard (skip with --no-upload).

The Bigger Picture

PinchBench matters because it tests what agents actually do — not what LLMs know. A model can ace HumanEval and still fail at “schedule a meeting for next Tuesday and email the attendees.”

The benchmarks reveal that tool use and multi-step reasoning are separate skills from raw intelligence. Some models nail both. Some nail neither. And price is almost completely uncorrelated with agent performance.

Stop guessing. Start measuring.

GitHub: pinchbench/skill
Leaderboard: pinchbench.com


Made by the team at kilo.ai. MIT licensed.