What is the LLM Architecture Gallery?

The LLM Architecture Gallery by Sebastian Raschka is a visual reference collecting architecture diagrams and fact sheets for every major open-weight LLM -- from GPT-2 (2019) through DeepSeek V3, Llama 4, Gemma, Mistral, and beyond. Each entry shows scale, decoder type, attention mechanism, and key architectural choices.

What is the difference between MHA, GQA, and MLA?

Multi-Head Attention (MHA) uses separate K/V heads for each query head -- accurate but memory-intensive. Grouped-Query Attention (GQA) shares K/V heads across groups of query heads, reducing KV cache memory significantly. Multi-Head Latent Attention (MLA), introduced in DeepSeek V3, compresses K/V into a low-rank latent vector -- even more memory-efficient with minimal quality loss.

What is the difference between dense and MoE LLMs?

A dense model activates all parameters for every token. A Mixture-of-Experts (MoE) model routes each token to a subset of 'expert' FFN layers -- DeepSeek V3 has 671B total parameters but only activates 37B per token. This makes MoE models faster and cheaper to run at scale while maintaining large model capacity.

What is RoPE and why did everyone switch to it?

Rotary Position Embedding (RoPE) encodes positional information as a rotation applied to query and key vectors. Unlike absolute learned embeddings, RoPE extrapolates better to sequence lengths not seen during training and has become the default in virtually every model released after 2022.

What is SwiGLU and why did it replace GELU?

SwiGLU is a gated activation function (Swish + GLU) that consistently outperforms GELU in practice for the same compute budget. Introduced by Noam Shazeer (of Attention is All You Need fame), it's now standard in Llama, Mistral, Gemma, and most modern open-weight models.

Every Major LLM Architecture in One Place -- Sebastian Raschka's Gallery Explained

Q: Where can I get the LLM Architecture Gallery as a poster?

Sebastian Raschka makes the gallery available as a high-resolution physical poster on Redbubble and Zazzle. The source image is a 182-megapixel, 56MB PNG -- prints beautifully at large format.

By Prahlad Menon Published 2026-03-17 5 min read

Seven years after GPT-2, the transformer is still the backbone of every major language model. What has changed is the pile of engineering refinements layered on top of that backbone — and keeping track of which model uses which trick is genuinely hard.

Sebastian Raschka, author of Build an LLM from Scratch and one of the most thorough ML educators working today, built a solution: the LLM Architecture Gallery. Updated as recently as March 16, 2026, it’s a single-page visual reference covering every major open-weight model with consistent architecture diagrams and fact sheets.

This post walks you through how to use it and what the key architectural decisions actually mean.

What the Gallery Covers

Each entry in the gallery shows:

Scale (parameter count)
Date released
Decoder type (dense vs MoE)
Attention mechanism
Key architectural detail (what makes this model different)

Models covered include GPT-2, OLMo 2, Llama 3.1/3.2/4, DeepSeek V2/V3, Gemma 2/3, Mistral, Mixtral, Phi-4, Qwen 2.5, Falcon, Nemotron, Command R, and more — essentially the full landscape of open-weight LLMs worth knowing about.

The gallery is derived from Raschka’s longer article The Big LLM Architecture Comparison, but distilled into visual fact sheets you can actually scan quickly.

The Four Architectural Decisions That Actually Matter

Reading through the gallery, four choices show up repeatedly. Understanding them lets you decode any new model announcement in under a minute.

1. Attention: MHA → GQA → MLA

Multi-Head Attention (MHA) — the original. Every query head has its own K/V heads. Accurate but memory-hungry because the KV cache grows with every head.

Grouped-Query Attention (GQA) — now the standard. Multiple query heads share a single K/V head group. Llama 3, Mistral, Gemma 2, and most 2024-2025 models use this. Reduces KV cache memory significantly with minimal quality loss.

Multi-Head Latent Attention (MLA) — DeepSeek V3’s innovation. Compresses K/V into a low-rank latent vector before projection. Even more memory-efficient than GQA, and the key reason DeepSeek V3 can serve at scale more cheaply than equivalently-sized models.

The gallery makes this progression visually obvious — you can literally see the K/V head count shrink as you scan through models by date.

2. Positional Encoding: Absolute → RoPE

Early models (GPT-2, early BERT variants) used learned absolute positional embeddings — a lookup table mapping position → vector. The problem: they don’t generalize well to sequence lengths not seen during training.

Rotary Position Embedding (RoPE), introduced in 2021, encodes position as a rotation applied to Q and K vectors during attention. It handles longer contexts better and has essentially replaced absolute embeddings in every model released after 2022. Every Llama variant, every Mistral, every DeepSeek — all RoPE.

3. Activation Function: GELU → SwiGLU

Not glamorous, but consistent. SwiGLU (Swish + Gated Linear Unit) outperforms GELU at the same compute budget across virtually every benchmark it’s been tested on. Noam Shazeer proposed it in 2020; Llama adopted it; now it’s the default everywhere.

The gallery shows exactly which models switched and when. GELU now mostly appears in older or specialized architectures.

4. Dense vs Mixture-of-Experts (MoE)

A dense model activates all parameters for every token. A MoE model has multiple “expert” FFN layers and routes each token to only a subset of them.

DeepSeek V3: 671B total parameters, 37B activated per token. Mixtral 8x7B: 47B total, 12.9B activated per token. The result is large model capacity at a fraction of the inference cost.

The tradeoff: MoE models are harder to train (load balancing across experts is non-trivial), and they require more total memory even if per-token compute is lower.

How to Actually Use the Gallery

If you’re evaluating a model for deployment: Filter by scale (parameter count) and check the decoder type. MoE models need more VRAM than their “active parameter” count suggests.

If you’re reading a new model paper: Find its nearest neighbor in the gallery by architecture type and date. The “key detail” field tells you what the authors thought was novel about their design choices.

If you’re training from scratch: The progression from GPT-2 → modern Llama stack (RoPE + GQA + SwiGLU + RMSNorm + no bias terms) is the baseline recipe. Every deviation from this should have a reason.

If you want it on your wall: Raschka sells the gallery as a physical poster on Redbubble and Zazzle. The source file is a 182-megapixel PNG — it’ll look sharp at any poster size.

The Bigger Picture

What Raschka’s article observes — and the gallery makes viscerally clear — is that the transformer architecture hasn’t fundamentally changed in seven years. GPT-2 and Llama 4 are recognizably the same family of models. What has changed is the accumulation of engineering refinements: RoPE, GQA, SwiGLU, RMSNorm, no-bias layers, MoE routing.

Each of those choices improved things by a few percent. Stacked together across hundreds of models and billions of training tokens, they add up to the current state of the art.

The gallery is the clearest single-page demonstration of that compounding that exists. Worth bookmarking, worth printing.

Gallery: sebastianraschka.com/llm-architecture-gallery
Full article: The Big LLM Architecture Comparison — Sebastian Raschka
Book: Build an LLM from Scratch — the definitive hands-on guide