Where Does Medical Knowledge Live Inside LLMs? Knowledge Maps Tell You
When an LLM answers a medical question correctly, where did that answer come from? Not which training document — which layers, which weights, which part of the network’s internal geometry?
A team from Lumos AI decided to find out. They systematically mapped where medical knowledge is stored inside five open-source LLMs — Llama 3.3-70B, Gemma 3-27B, MedGemma-27B, Qwen-32B, and GPT-OSS-120B — using four independent interpretability methods. The result: LLM knowledge maps that show, layer by layer, where different types of medical information are encoded.
The Four Lenses
The researchers didn’t rely on a single technique. They used four complementary methods to triangulate where knowledge lives:
- UMAP projections of intermediate activations — visualize how the model clusters medical concepts at each layer
- Gradient-based saliency — which weights contribute most to medical outputs
- Layer lesioning — remove a layer entirely, measure how much medical performance degrades
- Activation patching — replace one layer’s output with activations from a different prompt, see what breaks
Using multiple methods on the same models means the findings aren’t artifacts of one technique. When all four methods point to the same layers, you can trust the result.
The Findings That Matter
Medical knowledge concentrates early
In Llama 3.3-70B, most medical knowledge is processed in the first half of the model’s layers. This challenges the assumption that deeper layers handle more complex reasoning. For medical tasks at least, the heavy lifting happens earlier than you’d expect.
Age isn’t a number line
How does a model represent “the patient is 45 years old”? You might assume it’s a linear encoding — a smooth gradient from young to old. It’s not. Age is encoded non-linearly, with sharp discontinuities. The boundary between under-18 and over-18 shows up as a clear break in the representation — the model has internalized a categorical distinction that mirrors legal and clinical significance.
Disease progression is circular
This is the strangest finding. You’d expect disease progression to be represented as a sequence: early → moderate → severe. Instead, at certain layers, progression follows a circular pattern in activation space. The representation wraps around, which could reflect how diseases cycle through remission and relapse, or how the model learned from clinical narratives that don’t always follow linear trajectories.
Drugs cluster by specialty, not mechanism
In Llama 3.3-70B, drugs organize themselves by medical specialty (cardiology drugs together, oncology drugs together) rather than by pharmacological mechanism of action. This is exactly how clinicians think about medications — by the condition they treat, not by the receptor they bind — and suggests the model learned clinical practice patterns rather than pharmacology textbook structure.
Activation collapse and recovery
Gemma 3-27B and MedGemma-27B show a phenomenon where activations collapse at intermediate layers — losing discriminative structure — then recover by the final layers. This is architecturally concerning. It means there are layers in the network that contribute nothing meaningful to medical reasoning, or worse, that the model is working around its own internal bottleneck.
Why This Matters Beyond Research
The practical implication is precise: if you know which layers store which knowledge, you can target interventions.
Fine-tuning. Instead of fine-tuning the entire model, target the layers where medical knowledge concentrates. This is cheaper and less likely to damage other capabilities.
Debiasing. If you find demographic biases in specific layers (the age discontinuity is a clear example), you can apply corrections surgically rather than hoping RLHF catches everything.
Model editing. Need to update a drug interaction that changed in clinical guidelines? Layer-specific editing is more tractable than retraining.
Safety auditing. Before deploying a medical LLM, map its knowledge layers. If critical clinical knowledge is concentrated in layers that also show activation collapse or instability, that’s a red flag.
The Connection to Abliteration
This work sits in the same intellectual space as abliteration research — the idea that understanding where features live geometrically inside a network gives you power over those features. Abliteration found that refusal lives in a single direction. This paper finds that medical knowledge lives in specific layers with specific geometric properties.
The implication is the same: linear algebra on the right layers can modify model behavior with surgical precision. The question is whether we use that capability for safety (targeted debiasing, knowledge updates) or risk (removing clinical guardrails).
The Models and the Code
The five models analyzed span the current open-source medical AI landscape:
- Llama 3.3-70B — Meta’s general-purpose flagship
- Gemma 3-27B — Google’s efficient architecture
- MedGemma-27B — Google’s medical fine-tune of Gemma
- Qwen-32B — Alibaba’s competitive mid-size model
- GPT-OSS-120B — Large-scale open model
The paper and source code are available for reproducibility, and the methodology generalizes to any transformer-based LLM. If you’re deploying medical AI, mapping your model’s knowledge layers before deployment isn’t just good science — it’s due diligence.
Medical AI is moving from “does it get the right answer?” to “does it get the right answer for the right reasons, in the right part of the network?” This paper is a significant step toward that second question.