What is the surprising locality of LLM behaviors?

Capabilities, styles, and safety behaviors often live in specific subspaces of weight matrices. You can add, remove, or modify them with surgical precision. LLMs aren't inscrutable black boxes.

What are task vectors?

Capabilities as directions in weight space. From Ilharco et al.'s 'Editing Models with Task Arithmetic' — fine-tune a model on a task, compute the weight difference, and that vector represents the capability.

What does this mean for fine-tuning and AI safety?

Profound implications: model merging, targeted editing, interpretability, and safety interventions become more tractable when behaviors are localized rather than distributed across all weights.

The Surprising Locality of LLM Behaviors: Why a Few Weights Control Everything

By Prahlad Menon Published 2026-03-04 1 min read

We used to think large language models were inscrutable black boxes—billions of parameters working together in ways we couldn’t decompose or understand.

Turns out we were wrong.

A growing body of research shows that LLM behaviors are surprisingly localized. Capabilities, styles, and even safety behaviors often live in specific subspaces of the weight matrix. You can add, remove, or modify them with surgical precision.

This has profound implications for fine-tuning, model merging, interpretability—and AI safety.

Task Vectors: Capabilities as Directions

The foundational insight comes from Ilharco et al.’s 2023 paper “Editing Models with Task Arithmetic”:

task_vector = fine_tuned_weights - pretrained_weights

That’s it. Subtract the pretrained model from the fine-tuned model, and you get a “task vector”—a direction in parameter space that encodes the capability you trained.

The wild part? These vectors compose:

Operation	Result
`pretrained + task_vector_A`	Model with capability A
`pretrained + task_vector_A + task_vector_B`	Model with both capabilities
`pretrained - task_vector_A`	Model with capability A removed
`task_vector_A - task_vector_B`	The “difference” between capabilities

You can literally do arithmetic on model capabilities. Add French, subtract toxicity, combine coding and math. No retraining required.

LoRA: Fine-Tuning with 0.1% of Weights

If capabilities are localized, do we really need to update all parameters when fine-tuning?

LoRA (Low-Rank Adaptation) answered decisively: no.

By decomposing weight updates into low-rank matrices, LoRA achieves competitive fine-tuning performance while updating less than 1% of parameters. For a 7B model, that’s ~70M parameters instead of 7B.

This works because the “direction” that encodes a new capability doesn’t require changing every weight—just the ones that define that subspace.

Abliteration: Refusal is One Direction

Here’s where it gets uncomfortable for AI safety.

Abliteration is a technique that removes refusal behavior from language models. The process:

Collect prompts the model refuses (“How do I hack…”) and prompts it answers
Run both through the model, capture hidden states
Find the direction that separates refusal from compliance
Remove that direction from the model’s weights

The result? A model that answers everything, with its other capabilities intact.

The refusal behavior—all of it—lives in essentially one direction in the model’s representation space. Tools like OBLITERATUS now automate this with 13 different methods across 116+ models.

As one researcher put it:

“If alignment can be trivially removed by anyone with model weights and a GitHub tutorial, then alignment as currently implemented is not a security mechanism but a speed bump.”

The Geometry of Alignment

Recent work has gone further, showing you can fingerprint how a model was aligned just from its weight geometry:

DPO (Direct Preference Optimization) creates one kind of subspace
RLHF (Reinforcement Learning from Human Feedback) creates another
CAI (Constitutional AI) creates yet another

Each alignment method leaves a distinct signature. You can detect which was used without knowing anything about the training process—just by analyzing the model’s weights.

This means alignment isn’t hidden or distributed. It’s a specific, detectable, removable structure.

What This Means

For Fine-Tuning

Stop thinking about fine-tuning as “retraining the model.” Think about it as “adding a direction.”

Use LoRA or similar for efficiency
Consider task vector arithmetic for combining capabilities
Understand that you’re editing a subspace, not the whole model

For Model Merging

If capabilities are directions, you can merge models by combining their directions:

merged = base + 0.5 * (model_A - base) + 0.5 * (model_B - base)

This is why model merging works at all. Capabilities don’t destructively interfere because they often occupy different subspaces.

For Interpretability

The locality of behaviors is a gift for mechanistic interpretability. Instead of searching billions of parameters for “where does the model know French,” you can:

Create a task vector for French
Identify which layers/dimensions it affects most
Study those specific circuits

For AI Safety

This is the hard one.

If safety behaviors are as localized as other capabilities, they can be removed just as easily. Current alignment techniques create detectable, removable structures.

Some implications:

Open weights = removable safety (for motivated actors)
Safety must be distributed, not localized, to be robust
Detection might matter more than prevention

The “Embarrassingly Simple Defense” paper proposes dispersing safety signals across multiple dimensions to make them harder to isolate. But this is an arms race.

The Bigger Picture

Five years ago, we thought neural networks were mysterious soups of numbers. Now we’re finding structure:

What We Thought	What We Found
Capabilities are distributed	Many are localized to subspaces
Fine-tuning changes everything	Often changes <1% of weights
Alignment is deep	Often one direction, removable
Models are black boxes	Increasingly decomposable

This is good news for efficiency (cheaper fine-tuning), interpretability (findable circuits), and capability composition (model merging).

It’s more complicated news for safety. The same locality that makes models editable makes them un-editable in the safety direction.

But knowing the structure is better than not knowing. You can’t defend a system you don’t understand.

Building with LLMs? Check out soul.py for persistent agent memory, and soul-schema for auto-documenting your data warehouse.