GuppyLM is a 9M parameter language model trained from scratch that roleplays as a fish named Guppy. It's an educational project — one Colab notebook covers the entire LLM pipeline: synthetic data generation, BPE tokenizer training, vanilla transformer architecture, training loop, and interactive inference. Trains in 5 minutes on a free T4 GPU.

Can I train GuppyLM for free?

Yes. GuppyLM trains on Google Colab's free T4 GPU in about 5 minutes. The entire dataset (60K synthetic conversations) and model architecture fit within free-tier resource limits.

How do I make GuppyLM talk like something other than a fish?

Swap the training data. GuppyLM's data generator uses templates organized by topic. Replace the fish-themed templates with your own personality, topics, and vocabulary. The architecture, tokenizer, and training loop remain the same. The personality is baked into the weights through data, not system prompts.

What transformer architecture does GuppyLM use?

Vanilla transformer: 6 layers, 384 hidden dimension, 6 attention heads, 768 FFN with ReLU, 4096 BPE vocabulary, 128 token max sequence length, learned positional embeddings, LayerNorm, and weight-tied LM head. No modern tricks like GQA, RoPE, SwiGLU, or early exit.

Can GuppyLM run in the browser?

Yes. GuppyLM exports to quantized ONNX (~10MB) and runs entirely in the browser via WebAssembly. No server, no API keys. Try it at arman-bd.github.io/guppylm.

What will GuppyLM teach me about large language models?

GuppyLM demystifies every step: how training data shapes personality, how BPE tokenization works, how attention and feed-forward layers process sequences, how the training loop with cosine LR and mixed precision works, and why design decisions like context window length and vocabulary size matter. Every concept in GPT-4 exists here in miniature.

GuppyLM: Build a Working LLM From Scratch in 5 Minutes

By Prahlad Menon Published 2026-04-06 6 min read

There are hundreds of courses, blog posts, and YouTube videos explaining how transformers work. Most of them stop at theory.

GuppyLM is a 9M parameter language model that you train from scratch in 5 minutes on a free GPU. One notebook. Data generation, tokenizer, model architecture, training loop, inference. When you’re done, you have a working LLM that you built — not downloaded, not fine-tuned, built.

It talks like a fish:

You> what is the meaning of life
Guppy> food. the answer is always food.

You> do you love me
Guppy> you're my favorite big shape. my mouth are happy when you're here.

You> tell me a joke
Guppy> what did the fish say when it hit the wall. dam.

The fish part is charming. The education part is the point.

Why This Teaches More Than Courses

Most LLM education follows this pattern: “Here’s the attention mechanism math. Here’s a diagram. Now go use the OpenAI API.” You learn the theory but never touch the machinery.

GuppyLM inverts that. You touch every piece:

Generate synthetic training data — 60K conversations across 60 topics using template composition
Train a BPE tokenizer — 4,096 vocabulary, from raw text to token IDs
Build the transformer — 6 layers, 384 hidden dim, 6 heads, standard attention + FFN
Run the training loop — cosine LR schedule, mixed precision, gradient accumulation
Generate text — temperature sampling, interactive chat

Every concept that exists in GPT-4 exists here in miniature. The difference is you can read every line of code, modify it, break it, and understand why it breaks.

The Architecture (Deliberately Simple)

Component	Choice	Why
Parameters	8.7M	Trains in 5 min on free T4
Layers	6	Enough to learn patterns
Hidden dim	384	Balanced for model size
Attention heads	6	Standard multi-head
FFN	768 (ReLU)	Simple, interpretable
Vocabulary	4,096 (BPE)	Small enough to visualize
Max sequence	128 tokens	Fish don’t write essays
Position encoding	Learned embeddings	Simplest option
LM head	Weight-tied with embeddings	Saves parameters
Normalization	LayerNorm	Standard

No GQA. No RoPE. No SwiGLU. No KV cache. No early exit. Every modern transformer trick is deliberately excluded so you can see the core mechanism clearly before the optimizations.

What Each Piece Teaches You

Data Generation: Personality Is Data, Not Prompts

GuppyLM doesn’t use system prompts. The fish personality is baked into the weights through training data. This is a fundamental insight about how LLMs work: at a small scale, instruction-following doesn’t work. The model learns its personality from the statistical patterns in training data.

The data generator uses template composition with randomized components:

60 topic categories (food, bubbles, temperature, existential dread)
30 tank objects, 17 food types, 25 activities
~16K unique outputs from ~60 templates

Lesson: This is exactly how large-scale synthetic data works (Alpaca, WizardLM, Orca). GuppyLM just does it transparently enough to see the mechanism.

Tokenization: From Text to Numbers

The BPE tokenizer training is visible and inspectable. You can see:

How byte-pair encoding merges frequent character pairs
Why vocabulary size matters (4,096 is tiny — GPT-4 uses 100K+)
How the same word gets different token IDs based on context
Why unknown words get split into subword pieces

The Transformer: Attention in Action

At 6 layers and 384 dimensions, you can actually trace what the model does:

How query/key/value matrices create attention patterns
Why multi-head attention captures different relationships
How the FFN layers act as learned key-value memories
How LayerNorm stabilizes gradients during training

Training Loop: The Learning Process

The training notebook shows:

Cosine LR scheduling — why learning rate warmup and decay matter
Mixed precision (AMP) — how float16 speeds up training without losing quality
Loss curves — watching the model go from random noise to coherent fish speech
Overfitting signals — what happens when your model memorizes instead of generalizes

Inference: How Generation Works

The chat interface demonstrates:

Autoregressive generation — producing one token at a time, feeding output back as input
Temperature sampling — how randomness controls creativity vs. coherence
Context window limits — why the model degrades after 128 tokens (and why GPT-4’s 128K context is expensive)

The “Swap the Data” Experiment

This is the most valuable thing you can do with GuppyLM: change what it learns.

The fish personality comes entirely from the training data templates. Replace them with:

A pirate character
A Shakespearean poet
A customer service agent
A medical triage bot (simplified)
Your own persona

The model architecture stays identical. The tokenizer stays identical. The training loop stays identical. Only the data changes — and the model becomes a completely different entity.

This teaches a truth about LLMs that’s hard to internalize from theory: the model is the data. The transformer is just the learning machinery. What it becomes depends entirely on what you feed it.

Run It Right Now

In your browser (no install): arman-bd.github.io/guppylm — runs a quantized ONNX model (~10MB) via WebAssembly.

Chat in Colab (pre-trained model): use_guppylm.ipynb

Train from scratch (the real learning): train_guppylm.ipynb — set runtime to T4 GPU, run all cells. 5 minutes.

Local chat:

pip install torch tokenizers
python -m guppylm chat

Design Decisions Worth Studying

The README includes a “Design Decisions” section that’s worth reading carefully:

Why no system prompt? A 9M model can’t conditionally follow instructions. The personality must be baked in. This explains why fine-tuning exists — and why it works differently from prompting.

Why single-turn only? Multi-turn degrades at turn 3-4 with a 128-token context window. This illustrates why context length matters and why companies spend billions extending it.

Why vanilla transformer? Modern optimizations (GQA, SwiGLU, RoPE) don’t help at 9M parameters. They exist to scale efficiently — learning what they solve requires seeing the system without them first.

Why synthetic data? Consistency. A character needs consistent personality across examples. This is the same principle behind RLHF and constitutional AI — shaping behavior through carefully curated data.

The Numbers

9M parameters (vs. GPT-4’s ~1.8T)
5 minutes training time on free T4
60K synthetic training conversations
10MB quantized ONNX for browser inference
MIT license
GitHub | HuggingFace model | HuggingFace dataset

effGen: Autonomous Agents From Small Language Models — What small models can actually do in production
Unsloth: Fast LLM Fine-Tuning — When you want to fine-tune a real model instead of training from scratch
LLM Architecture Gallery by Raschka — Visual guide to how modern architectures differ