What is grokking in machine learning?

A phenomenon where neural networks suddenly generalize long after memorizing training data. The model achieves 100% training accuracy (memorization), plateaus with poor test accuracy, then suddenly jumps to near-100% test accuracy thousands of epochs later — having 'figured out' the underlying algorithm.

How did a 777-parameter transformer learn 10-digit addition?

It couldn't memorize (would need ~10^20 parameters for all combinations). With tied embeddings, no FFN bias, learned positions, and high learning rate (0.02), it discovered a general procedure for addition that works on numbers it never saw. The algorithm, not a lookup table.

What causes grokking to happen?

Weight decay (L2 regularization) continuously pushes weights toward zero. Eventually this pressure forces the model to find compact representations instead of large lookup tables. The compact representation turns out to be the actual algorithm — models have literally learned Fourier features for modular arithmetic.

What is the parameter cliff in grokking?

A sharp threshold around 800 parameters where models abruptly go from 0% to 100% accuracy. Below the cliff, no training helps. Above it, grokking happens reliably. The transition is sharp, suggesting minimum representational capacity needed to encode the algorithm.

Why does grokking matter?

Proves neural networks can learn algorithms (not just patterns), suggests overtraining might be underrated, shows weight decay pushes toward simpler general solutions, and opens interpretability paths — when models grok, we can often reverse-engineer what they learned.

What optimizations enable tiny grokking transformers?

Tied embeddings (input/output share weights), no FFN bias, learned positions (not sinusoidal), high learning rate (0.02). RMSNorm broke generalization, RoPE crashed, and going below d=7 hits a sharp parameter cliff.

Grokking: How a 777-Parameter Transformer Learned Real Math

By Prahlad Menon Published 2026-02-24 4 min read

In January 2025, something remarkable happened.

A transformer with only 777 parameters learned to perform 10-digit addition with 99.69% accuracy.

Let that sink in.

To memorize all possible 10-digit addition problems, you’d need roughly 10^20 parameters. This model had 777. It couldn’t possibly memorize. Which means it had to actually learn the algorithm.

This is grokking.

What Is Grokking?

Grokking is a phenomenon where neural networks suddenly generalize long after they’ve memorized the training data.

Here’s what it looks like:

Phase 1 (Memorization): The model achieves 100% training accuracy by memorizing examples. Test accuracy stays near random (0%).
Phase 2 (Plateau): Nothing visible happens. Training loss is low. Test accuracy is still garbage. Most people stop training here.
Phase 3 (Grokking): Suddenly — often thousands of epochs later — test accuracy jumps from ~0% to ~100%. The model has “figured out” the underlying algorithm.

The original 2022 paper — Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets by Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra at OpenAI — demonstrated this phenomenon on modular arithmetic (like a + b mod 97). If you have not read it, the YouTube walkthrough is the best 20-minute investment you can make in understanding why this matters. The 777-parameter result pushed it further: grokking works on arbitrary-precision decimal arithmetic — the kind of math we actually use.

The 777-Parameter Architecture

The winning model from yhavinga/gpt-acc-jax:

Layers: 1
Hidden dim: 7
Attention heads: 1
FFN dim: 14 (2× expansion)
Vocab size: 14
Context length: 35

Key optimizations that got it this small:

Tied embeddings: Input and output embeddings share weights
No FFN bias: Removes a few dozen parameters
Learned positions: Sinusoidal positions failed; learned ones are essential
High learning rate (0.02): Tiny models need higher LR to grok

What didn’t work:

RMSNorm (broke generalization)
No delimiters (needed +/= tokens)
RoPE embeddings (crashed)
Going below d=7 (sharp “parameter cliff”)

The Parameter Cliff

This is fascinating: there’s a hard threshold around 800 parameters where models abruptly go from 0% to 100% accuracy.

Below the cliff: no amount of training helps. The model simply lacks capacity. Above the cliff: grokking happens reliably.

The transition is sharp — not gradual. It suggests there’s a minimum representational capacity needed to encode the addition algorithm.

Why Does Grokking Happen?

The leading theory involves weight decay and representation compression.

During memorization, the model learns a “lookup table” with large, unstructured weights
Weight decay (L2 regularization) continuously pushes weights toward zero
Eventually, the pressure to shrink weights forces the model to find a more compact representation
The compact representation turns out to be… the actual algorithm

In Neel Nanda’s mechanistic interpretability work, researchers literally reverse-engineered grokked models and found they’d learned Fourier features for modular arithmetic — the same mathematical structure humans would use.

The network didn’t just learn to generalize. It discovered the correct math.

What We’re Experimenting With

At The Menon Lab, we’re running grokking experiments on modular arithmetic:

# Training config (matched to 777-param paper)
OPERATION = 'add'  # Modular addition
P = 97  # Prime modulus
EMBED_DIM = 7  # Match 777-param model
NUM_EPOCHS = 100000  # 100k epochs for grokking!
LEARNING_RATE = 2e-2  # Higher LR for tiny models
WEIGHT_DECAY = 1.0  # Crucial for grokking

Our goal: replicate grokking, visualize the weight evolution, and extend it to other arithmetic operations (subtraction, multiplication, division).

The code lives at menonpg/tiny-transformers.

Why This Matters

Grokking has profound implications:

1. Neural networks can learn algorithms, not just patterns. The 777-parameter model didn’t memorize. It couldn’t. It discovered a general procedure for addition that works on numbers it had never seen.

2. Overtraining might be underrated. We typically stop training when validation loss plateaus. Grokking shows that useful learning can happen long after apparent convergence.

3. Weight decay isn’t just regularization. It’s a force that pushes networks toward simpler, more general solutions — even when memorization works fine on the training set.

4. Interpretability gets concrete. When a model groks, we can often reverse-engineer what it learned. This opens a path to understanding neural networks mechanistically.

The Open Questions

Can we predict when grokking will happen?
Does grokking scale to larger, more complex algorithms?
Can we induce grokking deliberately (not just wait for it)?
What does the representation look like mid-grok?

These are active research areas. And with transformers this small, anyone with a laptop can run experiments.

Try It Yourself

Clone our notebook and run 100k epochs:

git clone https://github.com/menonpg/tiny-transformers
cd tiny-transformers/notebooks
# Open grokking_arithmetic.ipynb in Colab or Jupyter

Watch the training curves. The sudden jump from memorization to generalization is genuinely thrilling to witness.

Grokking reminds us that neural networks are stranger — and more capable — than we give them credit for. A 777-parameter model discovered addition. What else might be waiting to be grokked?

References:

Power, Burda, Edwards, Babuschkin, Misra (2022) — Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets · YouTube walkthrough
yhavinga/gpt-acc-jax - 777-parameter transformer
Nanda & Lieberum - Progress Measures for Grokking
stockeh/mlx-grokking - Grokking in <150 epochs with MLX