GuppyLM: Build a Working LLM From Scratch in 5 Minutes

By Prahlad Menon 6 min read

There are hundreds of courses, blog posts, and YouTube videos explaining how transformers work. Most of them stop at theory.

GuppyLM is a 9M parameter language model that you train from scratch in 5 minutes on a free GPU. One notebook. Data generation, tokenizer, model architecture, training loop, inference. When you’re done, you have a working LLM that you built — not downloaded, not fine-tuned, built.

It talks like a fish:

You> what is the meaning of life
Guppy> food. the answer is always food.

You> do you love me
Guppy> you're my favorite big shape. my mouth are happy when you're here.

You> tell me a joke
Guppy> what did the fish say when it hit the wall. dam.

The fish part is charming. The education part is the point.

Why This Teaches More Than Courses

Most LLM education follows this pattern: “Here’s the attention mechanism math. Here’s a diagram. Now go use the OpenAI API.” You learn the theory but never touch the machinery.

GuppyLM inverts that. You touch every piece:

  1. Generate synthetic training data — 60K conversations across 60 topics using template composition
  2. Train a BPE tokenizer — 4,096 vocabulary, from raw text to token IDs
  3. Build the transformer — 6 layers, 384 hidden dim, 6 heads, standard attention + FFN
  4. Run the training loop — cosine LR schedule, mixed precision, gradient accumulation
  5. Generate text — temperature sampling, interactive chat

Every concept that exists in GPT-4 exists here in miniature. The difference is you can read every line of code, modify it, break it, and understand why it breaks.

The Architecture (Deliberately Simple)

ComponentChoiceWhy
Parameters8.7MTrains in 5 min on free T4
Layers6Enough to learn patterns
Hidden dim384Balanced for model size
Attention heads6Standard multi-head
FFN768 (ReLU)Simple, interpretable
Vocabulary4,096 (BPE)Small enough to visualize
Max sequence128 tokensFish don’t write essays
Position encodingLearned embeddingsSimplest option
LM headWeight-tied with embeddingsSaves parameters
NormalizationLayerNormStandard

No GQA. No RoPE. No SwiGLU. No KV cache. No early exit. Every modern transformer trick is deliberately excluded so you can see the core mechanism clearly before the optimizations.

What Each Piece Teaches You

Data Generation: Personality Is Data, Not Prompts

GuppyLM doesn’t use system prompts. The fish personality is baked into the weights through training data. This is a fundamental insight about how LLMs work: at a small scale, instruction-following doesn’t work. The model learns its personality from the statistical patterns in training data.

The data generator uses template composition with randomized components:

  • 60 topic categories (food, bubbles, temperature, existential dread)
  • 30 tank objects, 17 food types, 25 activities
  • ~16K unique outputs from ~60 templates

Lesson: This is exactly how large-scale synthetic data works (Alpaca, WizardLM, Orca). GuppyLM just does it transparently enough to see the mechanism.

Tokenization: From Text to Numbers

The BPE tokenizer training is visible and inspectable. You can see:

  • How byte-pair encoding merges frequent character pairs
  • Why vocabulary size matters (4,096 is tiny — GPT-4 uses 100K+)
  • How the same word gets different token IDs based on context
  • Why unknown words get split into subword pieces

The Transformer: Attention in Action

At 6 layers and 384 dimensions, you can actually trace what the model does:

  • How query/key/value matrices create attention patterns
  • Why multi-head attention captures different relationships
  • How the FFN layers act as learned key-value memories
  • How LayerNorm stabilizes gradients during training

Training Loop: The Learning Process

The training notebook shows:

  • Cosine LR scheduling — why learning rate warmup and decay matter
  • Mixed precision (AMP) — how float16 speeds up training without losing quality
  • Loss curves — watching the model go from random noise to coherent fish speech
  • Overfitting signals — what happens when your model memorizes instead of generalizes

Inference: How Generation Works

The chat interface demonstrates:

  • Autoregressive generation — producing one token at a time, feeding output back as input
  • Temperature sampling — how randomness controls creativity vs. coherence
  • Context window limits — why the model degrades after 128 tokens (and why GPT-4’s 128K context is expensive)

The “Swap the Data” Experiment

This is the most valuable thing you can do with GuppyLM: change what it learns.

The fish personality comes entirely from the training data templates. Replace them with:

  • A pirate character
  • A Shakespearean poet
  • A customer service agent
  • A medical triage bot (simplified)
  • Your own persona

The model architecture stays identical. The tokenizer stays identical. The training loop stays identical. Only the data changes — and the model becomes a completely different entity.

This teaches a truth about LLMs that’s hard to internalize from theory: the model is the data. The transformer is just the learning machinery. What it becomes depends entirely on what you feed it.

Run It Right Now

In your browser (no install): arman-bd.github.io/guppylm — runs a quantized ONNX model (~10MB) via WebAssembly.

Chat in Colab (pre-trained model): use_guppylm.ipynb

Train from scratch (the real learning): train_guppylm.ipynb — set runtime to T4 GPU, run all cells. 5 minutes.

Local chat:

pip install torch tokenizers
python -m guppylm chat

Design Decisions Worth Studying

The README includes a “Design Decisions” section that’s worth reading carefully:

Why no system prompt? A 9M model can’t conditionally follow instructions. The personality must be baked in. This explains why fine-tuning exists — and why it works differently from prompting.

Why single-turn only? Multi-turn degrades at turn 3-4 with a 128-token context window. This illustrates why context length matters and why companies spend billions extending it.

Why vanilla transformer? Modern optimizations (GQA, SwiGLU, RoPE) don’t help at 9M parameters. They exist to scale efficiently — learning what they solve requires seeing the system without them first.

Why synthetic data? Consistency. A character needs consistent personality across examples. This is the same principle behind RLHF and constitutional AI — shaping behavior through carefully curated data.

The Numbers

  • 9M parameters (vs. GPT-4’s ~1.8T)
  • 5 minutes training time on free T4
  • 60K synthetic training conversations
  • 10MB quantized ONNX for browser inference
  • MIT license
  • GitHub | HuggingFace model | HuggingFace dataset