Every Major LLM Architecture in One Place -- Sebastian Raschka's Gallery Explained
Seven years after GPT-2, the transformer is still the backbone of every major language model. What has changed is the pile of engineering refinements layered on top of that backbone — and keeping track of which model uses which trick is genuinely hard.
Sebastian Raschka, author of Build an LLM from Scratch and one of the most thorough ML educators working today, built a solution: the LLM Architecture Gallery. Updated as recently as March 16, 2026, it’s a single-page visual reference covering every major open-weight model with consistent architecture diagrams and fact sheets.
This post walks you through how to use it and what the key architectural decisions actually mean.
What the Gallery Covers
Each entry in the gallery shows:
- Scale (parameter count)
- Date released
- Decoder type (dense vs MoE)
- Attention mechanism
- Key architectural detail (what makes this model different)
Models covered include GPT-2, OLMo 2, Llama 3.1/3.2/4, DeepSeek V2/V3, Gemma 2/3, Mistral, Mixtral, Phi-4, Qwen 2.5, Falcon, Nemotron, Command R, and more — essentially the full landscape of open-weight LLMs worth knowing about.
The gallery is derived from Raschka’s longer article The Big LLM Architecture Comparison, but distilled into visual fact sheets you can actually scan quickly.
The Four Architectural Decisions That Actually Matter
Reading through the gallery, four choices show up repeatedly. Understanding them lets you decode any new model announcement in under a minute.
1. Attention: MHA → GQA → MLA
Multi-Head Attention (MHA) — the original. Every query head has its own K/V heads. Accurate but memory-hungry because the KV cache grows with every head.
Grouped-Query Attention (GQA) — now the standard. Multiple query heads share a single K/V head group. Llama 3, Mistral, Gemma 2, and most 2024-2025 models use this. Reduces KV cache memory significantly with minimal quality loss.
Multi-Head Latent Attention (MLA) — DeepSeek V3’s innovation. Compresses K/V into a low-rank latent vector before projection. Even more memory-efficient than GQA, and the key reason DeepSeek V3 can serve at scale more cheaply than equivalently-sized models.
The gallery makes this progression visually obvious — you can literally see the K/V head count shrink as you scan through models by date.
2. Positional Encoding: Absolute → RoPE
Early models (GPT-2, early BERT variants) used learned absolute positional embeddings — a lookup table mapping position → vector. The problem: they don’t generalize well to sequence lengths not seen during training.
Rotary Position Embedding (RoPE), introduced in 2021, encodes position as a rotation applied to Q and K vectors during attention. It handles longer contexts better and has essentially replaced absolute embeddings in every model released after 2022. Every Llama variant, every Mistral, every DeepSeek — all RoPE.
3. Activation Function: GELU → SwiGLU
Not glamorous, but consistent. SwiGLU (Swish + Gated Linear Unit) outperforms GELU at the same compute budget across virtually every benchmark it’s been tested on. Noam Shazeer proposed it in 2020; Llama adopted it; now it’s the default everywhere.
The gallery shows exactly which models switched and when. GELU now mostly appears in older or specialized architectures.
4. Dense vs Mixture-of-Experts (MoE)
A dense model activates all parameters for every token. A MoE model has multiple “expert” FFN layers and routes each token to only a subset of them.
DeepSeek V3: 671B total parameters, 37B activated per token. Mixtral 8x7B: 47B total, 12.9B activated per token. The result is large model capacity at a fraction of the inference cost.
The tradeoff: MoE models are harder to train (load balancing across experts is non-trivial), and they require more total memory even if per-token compute is lower.
How to Actually Use the Gallery
If you’re evaluating a model for deployment: Filter by scale (parameter count) and check the decoder type. MoE models need more VRAM than their “active parameter” count suggests.
If you’re reading a new model paper: Find its nearest neighbor in the gallery by architecture type and date. The “key detail” field tells you what the authors thought was novel about their design choices.
If you’re training from scratch: The progression from GPT-2 → modern Llama stack (RoPE + GQA + SwiGLU + RMSNorm + no bias terms) is the baseline recipe. Every deviation from this should have a reason.
If you want it on your wall: Raschka sells the gallery as a physical poster on Redbubble and Zazzle. The source file is a 182-megapixel PNG — it’ll look sharp at any poster size.
The Bigger Picture
What Raschka’s article observes — and the gallery makes viscerally clear — is that the transformer architecture hasn’t fundamentally changed in seven years. GPT-2 and Llama 4 are recognizably the same family of models. What has changed is the accumulation of engineering refinements: RoPE, GQA, SwiGLU, RMSNorm, no-bias layers, MoE routing.
Each of those choices improved things by a few percent. Stacked together across hundreds of models and billions of training tokens, they add up to the current state of the art.
The gallery is the clearest single-page demonstration of that compounding that exists. Worth bookmarking, worth printing.
Gallery: sebastianraschka.com/llm-architecture-gallery
Full article: The Big LLM Architecture Comparison — Sebastian Raschka
Book: Build an LLM from Scratch — the definitive hands-on guide