What is pi-autoresearch?

pi-autoresearch is an open-source extension for the Pi coding agent that runs autonomous optimization loops. It tries ideas, benchmarks them, keeps improvements, reverts regressions, and repeats — letting AI continuously optimize any codebase against a measurable metric.

How did Shopify use autoresearch on Liquid?

CEO Tobi Lütke ran pi-autoresearch against Shopify's Liquid template engine with 120+ automated experiments. The agent found micro-optimizations like replacing regex with byte scanning, eliminating redundant object resets, and caching string conversions — producing 93 commits that achieved 53% faster rendering and 61% fewer memory allocations.

What is the autoresearch pattern?

Autoresearch is a loop pioneered by Andrej Karpathy: an AI agent brainstorms an optimization, applies it, runs benchmarks, keeps the change if metrics improve or reverts if they don't, then repeats. It works for any domain with a measurable metric — ML training, code performance, prompt engineering, or scientific models.

What can autoresearch optimize besides ML training?

The community has applied autoresearch to GPU kernel optimization (18→187 TFLOPS), voice agent prompts, baseball biomechanics prediction, ancient scroll ink detection, earth system models, Bitcoin price modeling, and RL post-training — anywhere there's a metric to optimize and code to tweak.

pi-autoresearch: Shopify's CEO Let an AI Optimize a 20-Year-Old Codebase — 53% Faster in 93 Commits

By Prahlad Menon Published 2026-04-16 4 min read

In March, Shopify CEO Tobi Lütke opened a PR against Liquid — the Ruby template engine powering millions of Shopify stores — with 93 commits, all generated by an AI agent running autonomous experiments. The result: 53% faster parse+render and 61% fewer memory allocations on a codebase that’s been hand-tuned by hundreds of contributors over 20 years.

The tool behind it, pi-autoresearch, is now open source.

The Autoresearch Pattern

This builds on Andrej Karpathy’s autoresearch — a deceptively simple idea we covered when it launched. The loop:

Try an optimization idea
Measure against a benchmark
Keep if it improves the metric, revert if it doesn’t
Repeat forever

Karpathy’s original was hardcoded to ML training (nanoGPT). What Lütke and collaborator David Cortés did was generalize it into a Pi extension that works on any optimization target — test speed, bundle size, build times, Lighthouse scores, whatever you can measure.

What the Agent Actually Found

The Liquid PR is fascinating because it shows what a tireless agent finds that humans miss. Some highlights from 120+ experiments:

Replaced StringScanner with byte-level scanning — Single-byte byteindex searching is ~40% faster than regex-based skip_until. This alone cut parse time by ~12%.
Eliminated costly StringScanner resets — The tokenizer was calling StringScanner#string= for every {% %} token (878 times per render). Manual byte scanning removed this entirely.
Cached small integer to_s — Pre-computed frozen strings for 0–999 avoid 267 Integer#to_s allocations per render. A micro-optimization no human would bother with, but it adds up.
Frozen string literals everywhere — The agent systematically found and froze string allocations across the codebase.

Each of these is individually minor. Stacked across 93 surviving commits, they compound to a 53% improvement on a codebase that most developers would consider “already optimized.”

pi-autoresearch: How It Works

The extension adds three tools to Pi:

Tool	Purpose
`init_experiment`	Configure the session — name, metric, unit, direction
`run_experiment`	Execute any command, capture output and timing
`log_experiment`	Record results, auto-commit, update the dashboard

Setup is one line:

pi install https://github.com/davebcn87/pi-autoresearch

Then start a session:

/autoresearch optimize unit test runtime, monitor correctness

The agent writes an autoresearch.md file (the objective and constraints) and an autoresearch.jsonl file (the experiment log). It loops autonomously — brainstorming ideas, trying them, keeping what works. A live dashboard shows progress in real-time, with a confidence score that distinguishes real improvements from noise.

The finalization skill is particularly clever: it takes the messy autoresearch branch and splits it into clean, independent branches — one per logical change — so each can be reviewed and merged separately.

The Autoresearch Ecosystem Is Growing Fast

What started as Karpathy’s single-file experiment has spawned an entire ecosystem. The awesome-autoresearch list tracks real-world applications:

GPU kernel optimization — 18 → 187 TFLOPS via autokernel
Voice agent prompt engineering — Score 0.728 → 0.969
Baseball pitch prediction — R² 0.44 → 0.78 from biomechanics data
Ancient scroll ink detection — 4 agents running 24/7 for the Vesuvius Challenge
Earth system models — Fire correlation 0.09 → 0.65
Bitcoin price modeling — 328 experiments, 50.5% RMSE improvement over power law
RL post-training — Eval 0.475 → 0.550 on GSM8K with Qwen 0.5B

The pattern works everywhere there’s a measurable metric and a codebase to tweak.

Why This Matters More Than It Seems

Simon Willison nailed it:

“CEOs can code again! Tobi has always been more hands-on than most, but this is a much more significant contribution than anyone would expect from the leader of a company with 7,500+ employees.”

Three observations:

Robust test suites are the real unlock. Liquid has 974 unit tests. Without them, the agent can’t safely experiment. The autoresearch pattern doesn’t replace testing — it requires it and rewards investment in it.

“Make it faster” becomes actionable when you give an agent a benchmark script. No ambiguity, no subjective judgment — just a number that goes up or down.

The agent finds what humans won’t bother with. No engineer is going to cache Integer#to_s for 0–999. It’s too tedious, the payoff per change is too small. But an agent that never sleeps will try it, measure a 0.3% improvement, commit it, and move on. Stack enough of those and you get 53%.

We built autoloop to generalize this same pattern — it works with any AI provider (Anthropic, OpenAI, Ollama) and any optimization target. The core insight is the same: the loop is the product, not the model.