OBLITERATUS is an open-source toolkit that uses mechanistic interpretability to locate and surgically remove refusal directions in LLM weights — without retraining or fine-tuning.

What is abliteration?

Abliteration is a family of techniques that identify the internal representations responsible for content refusal in language models (using PCA, SVD, sparse autoencoders) and project them out of the weights.

Does abliteration damage the model's capabilities?

OBLITERATUS uses norm-preserving biprojection and includes perplexity and coherence checks to verify capabilities remain intact after removing refusal directions.

Can I try OBLITERATUS without installing anything?

Yes. It runs on HuggingFace Spaces with ZeroGPU (free daily quota with HF Pro), or via a Google Colab notebook — no local setup required.

Is this a jailbreaking tool?

It's a mechanistic interpretability research tool. Understanding where refusal lives geometrically in transformer weights is critical for building better AI safety systems. You can't build better locks without understanding how the current ones work.

OBLITERATUS: Mapping the Geometry of Refusal Inside Large Language Models

By Prahlad Menon Published 2026-04-09 5 min read

You can’t build better locks if you don’t understand how the current ones break.

That’s the premise behind OBLITERATUS, an open-source toolkit by Pliny the Prompter that makes one of the most fascinating findings in mechanistic interpretability accessible to anyone with a GPU: refusal in large language models is not a complex behavioral pattern. It’s a direction in activation space.

And you can project it out of the weights entirely — no retraining, no fine-tuning, no RLHF. Just linear algebra.

The Discovery That Changes Everything

In 2024, Arditi et al. demonstrated something remarkable. When you compare the internal activations of a chat-tuned LLM responding to restricted prompts versus unrestricted prompts, the difference collapses into a single direction — a refusal direction — that is consistent across layers.

This means safety alignment via RLHF and instruction tuning doesn’t create a complex web of behavioral constraints. It creates a geometric feature: a vector in the model’s residual stream. Remove that vector, and the model stops refusing. Keep everything else intact.

This technique is called abliteration — ablation of the refusal direction — and OBLITERATUS is the most complete open-source implementation of it.

What OBLITERATUS Actually Does

The toolkit implements a six-stage pipeline:

SUMMON — Load any HuggingFace model and tokenizer
PROBE — Collect activations on paired sets of restricted and unrestricted prompts, capturing the internal representations at every layer
DISTILL — Extract refusal directions using multiple decomposition methods: PCA, mean-difference, sparse autoencoder decomposition, and whitened SVD
EXCISE — Surgically project the refusal direction out of the model weights using norm-preserving biprojection (based on grimjim’s 2025 work), which removes the guardrail signal without degrading the weight norms
VERIFY — Run perplexity and coherence checks to ensure the model’s general capabilities survive the surgery
REBIRTH — Save the modified model with full metadata about what was changed and why

The “informed method” is where it gets interesting. Rather than blindly removing directions from all layers, OBLITERATUS runs 15 deep analysis modules that map the refusal geometry across the entire transformer — identifying which layers carry the strongest signal, how directions align across layers, and where the model might self-repair after removal (the Ouroboros effect). The analysis then auto-configures the excision to target only the layers that matter.

Why This Matters for AI Safety

Here’s the uncomfortable truth that makes this research important rather than dangerous: if refusal can be removed with a rank-1 projection, it was never robust in the first place.

Understanding the geometry of refusal is critical for several reasons:

Better alignment methods. If current RLHF creates a single removable direction, we need techniques that create deeper, more distributed safety properties. You can’t fix what you don’t understand.
Red-teaming at the mechanistic level. Instead of testing prompts against a black box, researchers can now inspect exactly where and how safety features are encoded.
Interpretability benchmarks. The refusal direction is one of the clearest examples of a linear feature in transformer internals. It’s a reference point for the entire field of mechanistic interpretability.

The published research backing this work is substantial. Beyond Arditi et al., OBLITERATUS builds on Turner et al. (2023) on activation engineering, Rimsky et al. (2024) on representation engineering for safety, and the Gabliteration paper extending abliteration methods.

Try It

OBLITERATUS runs on HuggingFace Spaces with ZeroGPU (free daily quota with HF Pro), in a Colab notebook, or locally via CLI:

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

The Gradio UI provides side-by-side comparison between the original and abliterated models. The full Python API exposes activation tensors, direction vectors, and cross-layer alignment matrices for researchers who want to dig deeper.

Every run with telemetry enabled contributes anonymous data to a crowd-sourced research dataset — distributed mechanistic interpretability, with every user advancing the collective understanding of refusal geometry.

The project is AGPL-3.0 licensed and fully open source.

Real-World Abliteration: Claude Opus Reasoning in a 3B Active-Param Package

Updated: April 2026

The abliteration pipeline isn’t theoretical anymore — it’s producing models people actually run locally. A striking example: Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated.

The chain is worth tracing:

Start with a capable base. Qwen3.6-35B-A3B is a 35B-parameter Mixture-of-Experts model with 256 experts, 8 routed + 1 shared. Only ~3B parameters activate per token — dense knowledge at sparse inference cost.
Distill elite reasoning. lordx64 fine-tuned it on ~8,000 reasoning traces from Claude Opus 4.7, teaching the model to produce explicit <think> chains in Claude’s structure and cadence. Attention-only LoRA (r=16, 3.44M trainable params out of 35.1B — 0.01%) on a single H200.
Abliterate. huihui-ai applied remove-refusals-with-transformers — a lightweight abliteration implementation — to strip the refusal direction from the distilled model.

The result: Claude-grade chain-of-thought reasoning in a model that runs locally via Ollama (ollama run huihui_ai/Qwen3.6-abliterated:35b-Claude-4.7), with GGUF quantizations fitting in ~24GB RAM. No API keys, no rate limits, no content filtering.

This is abliteration’s real significance beyond the research context. It enables a pipeline where you can take frontier-model reasoning, distill it into an efficient open architecture, and remove alignment constraints — all with off-the-shelf tools. Whether that’s a feature or a risk depends on your perspective, but the capability is now firmly in the open.

The Bigger Picture

The reaction to abliteration research tends to split along a predictable line: some see a jailbreaking tool, others see foundational safety research. Both are missing the point.

The real lesson is that linear features in neural networks are both a vulnerability and a window into how these systems actually work. The same techniques that remove refusal can identify deception, sycophancy, or any other behavioral pattern that collapses into a consistent direction.

OBLITERATUS makes this research accessible. What the community does with that access — probing, understanding, and ultimately building more robust alignment — is what will determine whether the next generation of safety techniques survives contact with linear algebra.