pi-autoresearch: Shopify's CEO Let an AI Optimize a 20-Year-Old Codebase — 53% Faster in 93 Commits
In March, Shopify CEO Tobi Lütke opened a PR against Liquid — the Ruby template engine powering millions of Shopify stores — with 93 commits, all generated by an AI agent running autonomous experiments. The result: 53% faster parse+render and 61% fewer memory allocations on a codebase that’s been hand-tuned by hundreds of contributors over 20 years.
The tool behind it, pi-autoresearch, is now open source.
The Autoresearch Pattern
This builds on Andrej Karpathy’s autoresearch — a deceptively simple idea we covered when it launched. The loop:
- Try an optimization idea
- Measure against a benchmark
- Keep if it improves the metric, revert if it doesn’t
- Repeat forever
Karpathy’s original was hardcoded to ML training (nanoGPT). What Lütke and collaborator David Cortés did was generalize it into a Pi extension that works on any optimization target — test speed, bundle size, build times, Lighthouse scores, whatever you can measure.
What the Agent Actually Found
The Liquid PR is fascinating because it shows what a tireless agent finds that humans miss. Some highlights from 120+ experiments:
-
Replaced StringScanner with byte-level scanning — Single-byte
byteindexsearching is ~40% faster than regex-basedskip_until. This alone cut parse time by ~12%. -
Eliminated costly StringScanner resets — The tokenizer was calling
StringScanner#string=for every{% %}token (878 times per render). Manual byte scanning removed this entirely. -
Cached small integer
to_s— Pre-computed frozen strings for 0–999 avoid 267Integer#to_sallocations per render. A micro-optimization no human would bother with, but it adds up. -
Frozen string literals everywhere — The agent systematically found and froze string allocations across the codebase.
Each of these is individually minor. Stacked across 93 surviving commits, they compound to a 53% improvement on a codebase that most developers would consider “already optimized.”
pi-autoresearch: How It Works
The extension adds three tools to Pi:
| Tool | Purpose |
|---|---|
init_experiment | Configure the session — name, metric, unit, direction |
run_experiment | Execute any command, capture output and timing |
log_experiment | Record results, auto-commit, update the dashboard |
Setup is one line:
pi install https://github.com/davebcn87/pi-autoresearch
Then start a session:
/autoresearch optimize unit test runtime, monitor correctness
The agent writes an autoresearch.md file (the objective and constraints) and an autoresearch.jsonl file (the experiment log). It loops autonomously — brainstorming ideas, trying them, keeping what works. A live dashboard shows progress in real-time, with a confidence score that distinguishes real improvements from noise.
The finalization skill is particularly clever: it takes the messy autoresearch branch and splits it into clean, independent branches — one per logical change — so each can be reviewed and merged separately.
The Autoresearch Ecosystem Is Growing Fast
What started as Karpathy’s single-file experiment has spawned an entire ecosystem. The awesome-autoresearch list tracks real-world applications:
- GPU kernel optimization — 18 → 187 TFLOPS via autokernel
- Voice agent prompt engineering — Score 0.728 → 0.969
- Baseball pitch prediction — R² 0.44 → 0.78 from biomechanics data
- Ancient scroll ink detection — 4 agents running 24/7 for the Vesuvius Challenge
- Earth system models — Fire correlation 0.09 → 0.65
- Bitcoin price modeling — 328 experiments, 50.5% RMSE improvement over power law
- RL post-training — Eval 0.475 → 0.550 on GSM8K with Qwen 0.5B
The pattern works everywhere there’s a measurable metric and a codebase to tweak.
Why This Matters More Than It Seems
Simon Willison nailed it:
“CEOs can code again! Tobi has always been more hands-on than most, but this is a much more significant contribution than anyone would expect from the leader of a company with 7,500+ employees.”
Three observations:
Robust test suites are the real unlock. Liquid has 974 unit tests. Without them, the agent can’t safely experiment. The autoresearch pattern doesn’t replace testing — it requires it and rewards investment in it.
“Make it faster” becomes actionable when you give an agent a benchmark script. No ambiguity, no subjective judgment — just a number that goes up or down.
The agent finds what humans won’t bother with. No engineer is going to cache Integer#to_s for 0–999. It’s too tedious, the payoff per change is too small. But an agent that never sleeps will try it, measure a 0.3% improvement, commit it, and move on. Stack enough of those and you get 53%.
We built autoloop to generalize this same pattern — it works with any AI provider (Anthropic, OpenAI, Ollama) and any optimization target. The core insight is the same: the loop is the product, not the model.
Links
- pi-autoresearch — The Pi extension Tobi used
- Shopify Liquid PR — 93 commits, full experiment trace
- awesome-autoresearch — Curated use cases with traces
- autoresearch (Karpathy) — The original
- autoloop — Our generalization for any domain
- Simon Willison’s analysis