Superpowers, Chain-of-Thought, and the Problem of Impulsive Code: How Reasoning Became External

By Prahlad Menon 6 min read

There is a consistent failure mode in every coding agent you have ever used.

You describe what you want. The agent immediately starts writing code. It makes assumptions, skips edge cases, ignores the test suite, and produces something that mostly works if you squint at it and don’t push too hard. You ask it to fix one thing. It breaks two others. An hour later you’re back where you started with a messier codebase.

This failure mode has a name in cognitive science: System 1 thinking. Fast, associative, confident, often wrong. And the last decade of LLM research has been, in large part, a sustained effort to make language models slower.

The Reasoning Revolution: Making Models Think Before They Answer

In 2022, Google researchers published a deceptively simple finding: adding β€œLet’s think step by step” to a prompt dramatically improved model performance on math and logic tasks. This was Chain-of-Thought (CoT) prompting β€” forcing the model to generate intermediate reasoning steps before producing a final answer.

The intuition was right. Models that rush to an answer make different (worse) mistakes than models that work through a problem. The intermediate steps act as a kind of working memory, catching errors that would be invisible in a direct response.

CoT prompting evolved into reasoning models: OpenAI o1 and o3, Claude 3.7 extended thinking, DeepSeek R1. These models allocate additional test-time compute to generate internal reasoning traces β€” sometimes thousands of tokens of private β€œthinking” β€” before producing a final answer. The user sees a cleaner response; behind the scenes, the model has been arguing with itself.

The results are striking. On competition math (AIME), programming (Codeforces), and PhD-level science (GPQA), reasoning models outperform their non-reasoning counterparts by wide margins. Slow thinking works.

But reasoning models have a fundamental limitation for software development: the reasoning is hidden.

You can’t correct a thought the model had internally. You can’t review the design decisions made in the thinking space. You can’t say β€œwait, you assumed we’d use a relational database, but we’re actually using a document store.” By the time you see the output, the architectural decisions have already been made.

Superpowers: Externalized, Correctable Chain-of-Thought

Superpowers, built by Jesse Vincent, takes the same core insight β€” slow down before you act β€” and externalizes it as a structured workflow with human checkpoints.

The workflow has seven stages:

1. Brainstorming β€” Before writing a line of code, the agent asks clarifying questions, explores alternatives, and presents a design in digestible chunks. Not as internal tokens. As a document you can read and correct.

2. Git worktree setup β€” Creates an isolated workspace so the implementation doesn’t contaminate your main branch. Clean baseline verified before work starts.

3. Writing plans β€” A detailed implementation plan with exact file paths, complete code snippets, and verification steps for each task. Tasks are sized at 2-5 minutes each. Clear enough, as the README puts it, for β€œan enthusiastic junior engineer with poor taste, no judgement, no project context, and an aversion to testing.”

4. Subagent-driven development β€” Each task is dispatched to a fresh subagent with two-stage review: spec compliance first, then code quality. No accumulated context drift.

5. Test-driven development β€” Strict RED-GREEN-REFACTOR. Write failing test. Watch it fail. Write minimal code. Watch it pass. Commit. Code written before a test exists gets deleted.

6. Code review β€” Between tasks, the agent reviews against the plan. Critical issues block progress.

7. Branch finishing β€” Verifies tests, presents merge/PR/discard options, cleans up.

This is not a suggestion. Superpowers enforces these as mandatory workflows, not optional prompts. The agent checks for relevant skills before any task.

The Key Difference: Correctable vs Hidden Reasoning

Here is the structural comparison:

Standard LLMReasoning Model (o3/R1)Superpowers
Reasoning depthLowHighMedium-High
Reasoning visible?NoNo (hidden tokens)Yes (documents)
Human can correct?After the factAfter the factBefore execution
Persists across sessions?NoNoVia saved docs
Test coverage enforced?NoNoYes (TDD)

Reasoning models moved the thinking earlier in the process. Superpowers moves the thinking outside the model entirely β€” into artifacts that you, a human, can review and correct before the code is written.

This matters because the cost of an error scales with when it’s caught. A wrong assumption in the spec is free to fix. A wrong assumption in the implementation costs hours. A wrong assumption in production costs customers.

Where CoT, Reasoning Models, and Superpowers Intersect

The most interesting setup is using a reasoning model with Superpowers: the model’s internal thinking improves the quality of the externalized spec and plan, and the externalized spec and plan constrain the model’s subsequent implementation.

DeepSeek R1, for example, produces noticeably better brainstorming and implementation plans than its non-reasoning predecessor. The internal thinking produces better external artifacts. The external artifacts then bound the implementation. It compounds.

The analogy that fits: a great architect who sketches before they build, and is willing to throw away the sketch.

The Missing Layer: Memory

Both CoT reasoning and Superpowers address reasoning quality within a single session. Neither addresses what happens when the session ends.

A coding agent with Superpowers knows how to build well. It doesn’t remember:

  • Why you chose this architecture three weeks ago
  • That you tried the other approach and it failed
  • What the conventions of this codebase are
  • Who the stakeholders are and what they care about

This is where soul.py closes the loop. Persistent memory and identity across sessions β€” the same MEMORY.md that SoulSearch uses in the browser, applied to the coding agent context.

The three layers stack cleanly:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  soul.py / MEMORY.md                    β”‚  ← What do I know? (cross-session)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Superpowers                            β”‚  ← What should I build? (per-session)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Reasoning model (o3, R1, Claude 3.7)   β”‚  ← How should I think? (per-token)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

A coding agent with all three layers: thinks carefully at the token level, externalizes that thinking into correctable artifacts, and remembers what it learned last time.

That’s a qualitatively different thing from the agent that reads your prompt and immediately starts writing code.

Installation

Claude Code (official marketplace):

/plugin install superpowers@claude-plugins-official

Cursor:

/add-plugin superpowers

Codex / OpenCode / Gemini CLI: See the Superpowers README for platform-specific instructions.


Superpowers: github.com/obra/superpowers
soul.py: github.com/menonpg/soul.py
CoT paper (Wei et al. 2022): arxiv.org/abs/2201.11903


Continue to Part 2: code-review-graph β€” your agent’s other problem is reading the wrong code