pi-autoresearch: Shopify's CEO Let an AI Optimize a 20-Year-Old Codebase — 53% Faster in 93 Commits

By Prahlad Menon 4 min read

In March, Shopify CEO Tobi Lütke opened a PR against Liquid — the Ruby template engine powering millions of Shopify stores — with 93 commits, all generated by an AI agent running autonomous experiments. The result: 53% faster parse+render and 61% fewer memory allocations on a codebase that’s been hand-tuned by hundreds of contributors over 20 years.

The tool behind it, pi-autoresearch, is now open source.

The Autoresearch Pattern

This builds on Andrej Karpathy’s autoresearch — a deceptively simple idea we covered when it launched. The loop:

  1. Try an optimization idea
  2. Measure against a benchmark
  3. Keep if it improves the metric, revert if it doesn’t
  4. Repeat forever

Karpathy’s original was hardcoded to ML training (nanoGPT). What Lütke and collaborator David Cortés did was generalize it into a Pi extension that works on any optimization target — test speed, bundle size, build times, Lighthouse scores, whatever you can measure.

What the Agent Actually Found

The Liquid PR is fascinating because it shows what a tireless agent finds that humans miss. Some highlights from 120+ experiments:

  • Replaced StringScanner with byte-level scanning — Single-byte byteindex searching is ~40% faster than regex-based skip_until. This alone cut parse time by ~12%.

  • Eliminated costly StringScanner resets — The tokenizer was calling StringScanner#string= for every {% %} token (878 times per render). Manual byte scanning removed this entirely.

  • Cached small integer to_s — Pre-computed frozen strings for 0–999 avoid 267 Integer#to_s allocations per render. A micro-optimization no human would bother with, but it adds up.

  • Frozen string literals everywhere — The agent systematically found and froze string allocations across the codebase.

Each of these is individually minor. Stacked across 93 surviving commits, they compound to a 53% improvement on a codebase that most developers would consider “already optimized.”

pi-autoresearch: How It Works

The extension adds three tools to Pi:

ToolPurpose
init_experimentConfigure the session — name, metric, unit, direction
run_experimentExecute any command, capture output and timing
log_experimentRecord results, auto-commit, update the dashboard

Setup is one line:

pi install https://github.com/davebcn87/pi-autoresearch

Then start a session:

/autoresearch optimize unit test runtime, monitor correctness

The agent writes an autoresearch.md file (the objective and constraints) and an autoresearch.jsonl file (the experiment log). It loops autonomously — brainstorming ideas, trying them, keeping what works. A live dashboard shows progress in real-time, with a confidence score that distinguishes real improvements from noise.

The finalization skill is particularly clever: it takes the messy autoresearch branch and splits it into clean, independent branches — one per logical change — so each can be reviewed and merged separately.

The Autoresearch Ecosystem Is Growing Fast

What started as Karpathy’s single-file experiment has spawned an entire ecosystem. The awesome-autoresearch list tracks real-world applications:

  • GPU kernel optimization — 18 → 187 TFLOPS via autokernel
  • Voice agent prompt engineering — Score 0.728 → 0.969
  • Baseball pitch prediction — R² 0.44 → 0.78 from biomechanics data
  • Ancient scroll ink detection — 4 agents running 24/7 for the Vesuvius Challenge
  • Earth system models — Fire correlation 0.09 → 0.65
  • Bitcoin price modeling — 328 experiments, 50.5% RMSE improvement over power law
  • RL post-training — Eval 0.475 → 0.550 on GSM8K with Qwen 0.5B

The pattern works everywhere there’s a measurable metric and a codebase to tweak.

Why This Matters More Than It Seems

Simon Willison nailed it:

“CEOs can code again! Tobi has always been more hands-on than most, but this is a much more significant contribution than anyone would expect from the leader of a company with 7,500+ employees.”

Three observations:

Robust test suites are the real unlock. Liquid has 974 unit tests. Without them, the agent can’t safely experiment. The autoresearch pattern doesn’t replace testing — it requires it and rewards investment in it.

“Make it faster” becomes actionable when you give an agent a benchmark script. No ambiguity, no subjective judgment — just a number that goes up or down.

The agent finds what humans won’t bother with. No engineer is going to cache Integer#to_s for 0–999. It’s too tedious, the payoff per change is too small. But an agent that never sleeps will try it, measure a 0.3% improvement, commit it, and move on. Stack enough of those and you get 53%.

We built autoloop to generalize this same pattern — it works with any AI provider (Anthropic, OpenAI, Ollama) and any optimization target. The core insight is the same: the loop is the product, not the model.