GuppyLM: Build a Working LLM From Scratch in 5 Minutes
There are hundreds of courses, blog posts, and YouTube videos explaining how transformers work. Most of them stop at theory.
GuppyLM is a 9M parameter language model that you train from scratch in 5 minutes on a free GPU. One notebook. Data generation, tokenizer, model architecture, training loop, inference. When you’re done, you have a working LLM that you built — not downloaded, not fine-tuned, built.
It talks like a fish:
You> what is the meaning of life
Guppy> food. the answer is always food.
You> do you love me
Guppy> you're my favorite big shape. my mouth are happy when you're here.
You> tell me a joke
Guppy> what did the fish say when it hit the wall. dam.
The fish part is charming. The education part is the point.
Why This Teaches More Than Courses
Most LLM education follows this pattern: “Here’s the attention mechanism math. Here’s a diagram. Now go use the OpenAI API.” You learn the theory but never touch the machinery.
GuppyLM inverts that. You touch every piece:
- Generate synthetic training data — 60K conversations across 60 topics using template composition
- Train a BPE tokenizer — 4,096 vocabulary, from raw text to token IDs
- Build the transformer — 6 layers, 384 hidden dim, 6 heads, standard attention + FFN
- Run the training loop — cosine LR schedule, mixed precision, gradient accumulation
- Generate text — temperature sampling, interactive chat
Every concept that exists in GPT-4 exists here in miniature. The difference is you can read every line of code, modify it, break it, and understand why it breaks.
The Architecture (Deliberately Simple)
| Component | Choice | Why |
|---|---|---|
| Parameters | 8.7M | Trains in 5 min on free T4 |
| Layers | 6 | Enough to learn patterns |
| Hidden dim | 384 | Balanced for model size |
| Attention heads | 6 | Standard multi-head |
| FFN | 768 (ReLU) | Simple, interpretable |
| Vocabulary | 4,096 (BPE) | Small enough to visualize |
| Max sequence | 128 tokens | Fish don’t write essays |
| Position encoding | Learned embeddings | Simplest option |
| LM head | Weight-tied with embeddings | Saves parameters |
| Normalization | LayerNorm | Standard |
No GQA. No RoPE. No SwiGLU. No KV cache. No early exit. Every modern transformer trick is deliberately excluded so you can see the core mechanism clearly before the optimizations.
What Each Piece Teaches You
Data Generation: Personality Is Data, Not Prompts
GuppyLM doesn’t use system prompts. The fish personality is baked into the weights through training data. This is a fundamental insight about how LLMs work: at a small scale, instruction-following doesn’t work. The model learns its personality from the statistical patterns in training data.
The data generator uses template composition with randomized components:
- 60 topic categories (food, bubbles, temperature, existential dread)
- 30 tank objects, 17 food types, 25 activities
- ~16K unique outputs from ~60 templates
Lesson: This is exactly how large-scale synthetic data works (Alpaca, WizardLM, Orca). GuppyLM just does it transparently enough to see the mechanism.
Tokenization: From Text to Numbers
The BPE tokenizer training is visible and inspectable. You can see:
- How byte-pair encoding merges frequent character pairs
- Why vocabulary size matters (4,096 is tiny — GPT-4 uses 100K+)
- How the same word gets different token IDs based on context
- Why unknown words get split into subword pieces
The Transformer: Attention in Action
At 6 layers and 384 dimensions, you can actually trace what the model does:
- How query/key/value matrices create attention patterns
- Why multi-head attention captures different relationships
- How the FFN layers act as learned key-value memories
- How LayerNorm stabilizes gradients during training
Training Loop: The Learning Process
The training notebook shows:
- Cosine LR scheduling — why learning rate warmup and decay matter
- Mixed precision (AMP) — how float16 speeds up training without losing quality
- Loss curves — watching the model go from random noise to coherent fish speech
- Overfitting signals — what happens when your model memorizes instead of generalizes
Inference: How Generation Works
The chat interface demonstrates:
- Autoregressive generation — producing one token at a time, feeding output back as input
- Temperature sampling — how randomness controls creativity vs. coherence
- Context window limits — why the model degrades after 128 tokens (and why GPT-4’s 128K context is expensive)
The “Swap the Data” Experiment
This is the most valuable thing you can do with GuppyLM: change what it learns.
The fish personality comes entirely from the training data templates. Replace them with:
- A pirate character
- A Shakespearean poet
- A customer service agent
- A medical triage bot (simplified)
- Your own persona
The model architecture stays identical. The tokenizer stays identical. The training loop stays identical. Only the data changes — and the model becomes a completely different entity.
This teaches a truth about LLMs that’s hard to internalize from theory: the model is the data. The transformer is just the learning machinery. What it becomes depends entirely on what you feed it.
Run It Right Now
In your browser (no install): arman-bd.github.io/guppylm — runs a quantized ONNX model (~10MB) via WebAssembly.
Chat in Colab (pre-trained model): use_guppylm.ipynb
Train from scratch (the real learning): train_guppylm.ipynb — set runtime to T4 GPU, run all cells. 5 minutes.
Local chat:
pip install torch tokenizers
python -m guppylm chat
Design Decisions Worth Studying
The README includes a “Design Decisions” section that’s worth reading carefully:
Why no system prompt? A 9M model can’t conditionally follow instructions. The personality must be baked in. This explains why fine-tuning exists — and why it works differently from prompting.
Why single-turn only? Multi-turn degrades at turn 3-4 with a 128-token context window. This illustrates why context length matters and why companies spend billions extending it.
Why vanilla transformer? Modern optimizations (GQA, SwiGLU, RoPE) don’t help at 9M parameters. They exist to scale efficiently — learning what they solve requires seeing the system without them first.
Why synthetic data? Consistency. A character needs consistent personality across examples. This is the same principle behind RLHF and constitutional AI — shaping behavior through carefully curated data.
The Numbers
- 9M parameters (vs. GPT-4’s ~1.8T)
- 5 minutes training time on free T4
- 60K synthetic training conversations
- 10MB quantized ONNX for browser inference
- MIT license
- GitHub | HuggingFace model | HuggingFace dataset
Related Reading
- effGen: Autonomous Agents From Small Language Models — What small models can actually do in production
- Unsloth: Fast LLM Fine-Tuning — When you want to fine-tune a real model instead of training from scratch
- LLM Architecture Gallery by Raschka — Visual guide to how modern architectures differ