Part 4: Mechanistic Interpretability – Reverse-Engineering the AI Brain

14 minute read

This is Part 4 of a 12-part series exploring the intersection of artificial intelligence and cybersecurity. We have covered the math (Part 1), the architecture (Part 2), and the threat landscape (Part 3). Now we learn to see inside the black box.

The Reverse Engineer’s Mindset

I spent years in cybersecurity doing reverse engineering — taking compiled binaries, stripping them down in IDA Pro or Ghidra, tracing execution paths, and building a mental model of what the software actually does versus what the documentation claims it does. That discipline — that refusal to trust the label on the box — is exactly what we need for AI security.

Mechanistic interpretability is the practice of reverse-engineering neural networks to understand the specific computational mechanisms they use to produce their outputs. It is not about getting a vague “explanation” of why a model made a decision. It is about identifying the exact neurons, the exact weight values, and the exact tensor operations that implement specific behaviors.

If traditional explainability is like reading the marketing brochure, mechanistic interpretability is like reading the disassembly.

The field has exploded in the past few years, driven primarily by researchers at Anthropic, DeepMind, and several academic labs who recognized that understanding model internals is not optional — it is a prerequisite for alignment, safety, and security.

Why Interpretability Is a Security Imperative

Let me connect this directly to the threat model we built in Part 3.

We identified three attack layers: input, weight, and output. For each one, our ability to detect and defend against attacks depends on our ability to understand what the model is doing internally:

Detecting prompt injection requires understanding which tokens are receiving disproportionate attention and why.
Detecting data poisoning backdoors requires identifying anomalous computational pathways that activate only under specific trigger conditions.
Detecting memorized training data requires understanding which FFN neurons store which facts and when they activate.

Without interpretability, we are doing black-box security testing — poking at the model from the outside and hoping we find the problems. With interpretability, we can do white-box analysis — examining the model’s internals with the same rigor we apply to source code review.

As Neel Nanda, a leading interpretability researcher, has argued: “Mechanistic interpretability is to AI safety what decompilation is to software security. It is the foundation on which everything else is built” (Nanda, 2023).

Superposition: The First Barrier

Before we can interpret what is happening inside a neural network, we need to confront the single biggest obstacle: superposition.

In an ideal world, each neuron in a network would represent one concept. Neuron 4,782 represents “cats.” Neuron 11,203 represents “legal liability.” You could read the model like a dictionary. But that is not how it works.

Real neural networks encode far more concepts than they have neurons. They accomplish this through superposition — representing multiple concepts as overlapping patterns across groups of neurons. Elhage et al. (2022) from Anthropic formalized this in their paper “Toy Models of Superposition,” demonstrating mathematically that neural networks learn to represent more features than they have dimensions by exploiting the geometry of high-dimensional spaces.

Think of it this way: in a 100-dimensional space, you can have far more than 100 nearly-orthogonal directions. Each direction can represent a different concept, and as long as the directions are approximately orthogonal (their dot products are close to zero), the representations do not interfere with each other too much.

import torch
import torch.nn.functional as F

# In high dimensions, random vectors are nearly orthogonal
dim = 768  # typical embedding dimension
num_features = 5000  # many more features than dimensions

# Generate random feature directions
feature_directions = torch.randn(num_features, dim)
feature_directions = F.normalize(feature_directions, dim=-1)

# Compute pairwise cosine similarities
# Most will be near zero (nearly orthogonal)
cos_sim = feature_directions @ feature_directions.T
off_diagonal = cos_sim[~torch.eye(num_features, dtype=bool)]

print(f"Mean absolute cosine similarity: {off_diagonal.abs().mean():.4f}")
# Typically around 0.03-0.05 — nearly orthogonal despite 5000 > 768

The implication for security is profound. When we inspect a single neuron’s activation, we are not seeing a single concept — we are seeing the superposition of multiple concepts projected onto one dimension. Identifying which concept is active requires decomposing the superposition, and that is where the real interpretability tools come in.

Sparse Autoencoders: Decomposing Superposition

The breakthrough tool for dealing with superposition is the sparse autoencoder (SAE). Cunningham et al. (2023) and Bricken et al. (2023) independently showed that training sparse autoencoders on a model’s internal activations can decompose superposed representations into interpretable features.

The idea is conceptually simple:

Collect activation vectors from a model’s hidden layers as it processes many different inputs.
Train an autoencoder that reconstructs these activations, but with a sparsity constraint on the hidden layer.
The sparsity constraint forces the autoencoder to learn a dictionary of monosemantic features — individual directions in activation space that correspond to single, interpretable concepts.

import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
    """
    Sparse autoencoder for decomposing neural network activations
    into interpretable features.
    """
    def __init__(self, input_dim, hidden_dim, sparsity_coeff=1e-3):
        super().__init__()
        self.encoder = nn.Linear(input_dim, hidden_dim)
        self.decoder = nn.Linear(hidden_dim, input_dim)
        self.sparsity_coeff = sparsity_coeff

    def forward(self, x):
        # Encode: project to higher-dim sparse space
        hidden = torch.relu(self.encoder(x))

        # Decode: reconstruct original activations
        reconstructed = self.decoder(hidden)

        # Loss = reconstruction error + sparsity penalty
        recon_loss = (x - reconstructed).pow(2).mean()
        sparsity_loss = hidden.abs().mean()

        total_loss = recon_loss + self.sparsity_coeff * sparsity_loss
        return reconstructed, hidden, total_loss

# hidden_dim >> input_dim to allow overcomplete representation
# e.g., input_dim=768, hidden_dim=32768
# Each of the 32768 hidden units ideally represents one concept

Anthropic’s research team applied this at scale to Claude, identifying thousands of interpretable features including concepts like “Golden Gate Bridge,” “code written in Python,” “deceptive reasoning,” and “requests to bypass safety measures” (Templeton et al., 2024). Each feature corresponds to a specific direction in the model’s activation space — a specific tensor pattern that activates when the model is processing that concept.

For security engineers, this is transformative. Instead of probing a model with thousands of test prompts and hoping to trigger misbehavior, we can directly inspect whether “dangerous” features are activating. We can monitor the “deceptive reasoning” feature in real-time. We can build detectors that watch for anomalous feature activation patterns that might indicate a backdoor trigger.

Circuit Analysis: Tracing Computational Pathways

If sparse autoencoders tell us what features a model is using, circuit analysis tells us how those features are connected — the specific computational pathways that transform inputs into outputs.

Olah et al. (2020) pioneered this approach in their “Zoom In” work on vision models, identifying interpretable circuits like:

Curve detectors: Early-layer neurons that detect curved edges, connected to…
Circle detectors: Mid-layer neurons that combine curves into circle detections, connected to…
Wheel detectors: Later-layer neurons that identify wheels, connected to…
Car detectors: High-level neurons that recognize cars.

Each connection in this chain is a specific weight tensor value — a specific number that determines how strongly one neuron’s output influences another neuron’s input. The “car detection circuit” is a traceable path through the network’s tensor weights.

Wang et al. (2023) extended this to language models, reverse-engineering the Indirect Object Identification (IOI) circuit in GPT-2. They identified the specific attention heads and MLP neurons that implement the algorithm for resolving sentences like “When Mary and John went to the store, John gave a drink to ___”:

Duplicate Token Heads: Identify that “John” appears twice
S-Inhibition Heads: Suppress the repeated name
Name Mover Heads: Copy the non-repeated name (“Mary”) to the output

Each of these is a specific set of attention heads with specific weight tensor values. The researchers could modify individual tensor values and observe precisely how the circuit’s behavior changed.

Activation Patching: The Scalpel of Interpretability

The primary technique for circuit analysis is activation patching (also called causal tracing). The method:

Run the model on a clean input and record all intermediate activations.
Run the model on a corrupted input (where key information is changed).
Selectively replace (“patch”) specific activations from the clean run into the corrupted run.
Observe which patches restore the correct behavior.

def activation_patching(model, clean_input, corrupted_input, layer, position):
    """
    Replace a specific activation in the corrupted run
    with the activation from the clean run.
    """
    # Get clean activations
    clean_cache = {}
    def hook_clean(module, input, output):
        clean_cache['activation'] = output.clone()

    handle = model.layers[layer].register_forward_hook(hook_clean)
    model(clean_input)
    handle.remove()

    # Run corrupted input but patch in clean activation at target position
    def hook_patch(module, input, output):
        output[:, position, :] = clean_cache['activation'][:, position, :]
        return output

    handle = model.layers[layer].register_forward_hook(hook_patch)
    patched_output = model(corrupted_input)
    handle.remove()

    return patched_output

If patching a specific layer and position restores the correct output, that tells us this location is causally responsible for the computation. By systematically patching different locations, we build a map of which components are necessary for which behaviors — a causal circuit diagram of the model’s reasoning.

Meng et al. (2022) used this technique to localize factual knowledge in GPT-J, finding that specific facts are stored in specific MLP layers at specific token positions. They could then surgically edit the model’s knowledge by modifying a handful of weight values — changing “The Eiffel Tower is in Paris” to “The Eiffel Tower is in Rome” by editing fewer than 100 parameters out of billions.

Probing Classifiers: What Does the Model Know?

Another powerful interpretability technique is probing — training small classifier networks on a model’s internal representations to test whether specific information is encoded there.

The method is straightforward:

Run many examples through the model and collect internal activations at a specific layer.
Train a simple linear classifier to predict some property (e.g., “is this token part of a named entity?”) from these activations.
If the classifier succeeds, that information is linearly represented in the model’s activations at that layer.

from sklearn.linear_model import LogisticRegression
import numpy as np

def probe_for_concept(model, dataset, layer_idx, concept_labels):
    """
    Train a linear probe to detect if a concept is
    encoded at a specific layer.
    """
    activations = []
    for text in dataset:
        acts = model.get_activations(text, layer=layer_idx)
        activations.append(acts.mean(dim=1).detach().numpy())  # avg over positions

    X = np.stack(activations)
    y = np.array(concept_labels)

    probe = LogisticRegression(max_iter=1000)
    probe.fit(X, y)

    accuracy = probe.score(X, y)
    print(f"Layer {layer_idx} probe accuracy: {accuracy:.3f}")
    return probe

Belinkov (2022) surveyed the probing literature comprehensively, showing that different types of information are encoded at different depths in a transformer. Syntactic information (part-of-speech, dependency relations) tends to be most accessible in earlier layers, while semantic information (sentiment, topic, intent) is more prominent in later layers.

For security, probing enables us to ask questions like:

“Does this model encode information about harmful intent at layer 24?”
“Is the concept of ‘deception’ linearly separable in the model’s activation space?”
“Can we detect when the model is ‘reasoning about’ bypassing safety constraints?”

If the answer is yes, we can build runtime monitors that detect these activations and intervene before the model produces harmful output. This is defense at the tensor level — operating on the same mathematical substrate where the threats live.

Logit Lens and Tuned Lens: Reading the Model’s Draft Answers

A beautifully intuitive interpretability technique is the logit lens (nostalgebraist, 2020), later refined into the tuned lens by Belrose et al. (2023). The idea: at every layer of the transformer, project the intermediate activations through the model’s final unembedding matrix to see what token the model would predict if that layer were the last one.

This gives us a layer-by-layer view of how the model refines its predictions:

def logit_lens(model, input_text):
    """
    At each layer, peek at what the model would predict
    if processing stopped at that layer.
    """
    tokens = model.tokenize(input_text)
    hidden_states = model.get_all_hidden_states(tokens)

    for layer_idx, hidden in enumerate(hidden_states):
        # Project through the unembedding matrix
        logits = hidden @ model.unembed.weight.T
        top_token = logits[0, -1].argmax()
        top_word = model.decode(top_token)
        prob = torch.softmax(logits[0, -1], dim=0)[top_token].item()
        print(f"Layer {layer_idx:2d}: '{top_word}' (p={prob:.3f})")

Watching a model’s prediction evolve layer by layer is like watching a photograph develop in a darkroom. The early layers capture broad, noisy patterns. Middle layers refine and disambiguate. Later layers sharpen the final prediction. When a model gets something wrong — or when an adversarial input succeeds — the logit lens shows us exactly where in the processing pipeline things went awry.

Interpretability for Backdoor Detection

Let me bring this back to concrete security applications. One of the most promising uses of mechanistic interpretability is detecting backdoors planted through data poisoning.

Traditional backdoor detection methods are black-box: they test the model with various inputs and look for anomalous behavior. But this requires knowing (or guessing) what the trigger might be. If the trigger is subtle enough, black-box testing might miss it entirely.

Mechanistic interpretability offers a white-box alternative. Casper et al. (2023) proposed using interpretability techniques to identify backdoors by looking for:

Anomalous feature activations: Features that activate only for specific, unusual input patterns (potential triggers).
Hidden computational pathways: Circuits that are dormant for most inputs but activate strongly for specific patterns.
Inconsistent representations: Cases where the model’s internal representation of an input diverges significantly from its representation of semantically similar inputs.

Anthropic’s discovery of a “deceptive reasoning” feature in Claude (Templeton et al., 2024) demonstrated that potentially dangerous internal states can be identified and monitored. While this particular feature arose from training rather than adversarial poisoning, the technique generalizes: if we can find features representing harmful behaviors, we can build systems that detect and suppress them.

The Toolbox: Getting Started with Interpretability

If you want to start doing interpretability research yourself, here is the practical toolkit:

TransformerLens

Developed by Neel Nanda, TransformerLens is the go-to library for mechanistic interpretability research. It provides clean interfaces for hooking into model internals, caching activations, and performing patching experiments.

# pip install transformer-lens
from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2-small")

# Run with full activation caching
logits, cache = model.run_with_cache("The security of AI systems")

# Access any internal activation
layer_5_attention = cache["blocks.5.attn.hook_pattern"]
layer_10_mlp = cache["blocks.10.hook_mlp_out"]

print(f"Attention pattern shape: {layer_5_attention.shape}")
print(f"MLP output shape: {layer_10_mlp.shape}")

SAELens

For working with sparse autoencoders, SAELens provides pre-trained SAEs for popular models and tools for training your own.

Circuitsvis

For visualizing attention patterns and other internal model states, Circuitsvis provides interactive HTML visualizations that can be embedded in Jupyter notebooks.

Baukit

Developed at MIT, Baukit (formerly known as nethook) provides utilities for intervention experiments — patching, ablating, and modifying model activations at runtime.

The Limits of Interpretability (Honest Assessment)

I want to be honest about where we are. Mechanistic interpretability is a young field, and our tools are still primitive relative to the complexity of the systems we are trying to understand.

Scale challenges: The IOI circuit in GPT-2 Small involved about 26 attention heads. GPT-4 has thousands. Fully reverse-engineering a frontier model’s circuits is currently intractable.

Superposition remains hard: Sparse autoencoders help, but we do not know if they capture all features, or if some computations are fundamentally distributed in ways that resist decomposition.

Faithfulness concerns: When we identify a “circuit” for some behavior, how confident can we be that the circuit fully explains the behavior? Might the model use different circuits for the same task on different inputs?

Huang et al. (2024) raised important questions about the reliability of interpretability methods, showing that some popular techniques can produce misleading results if applied carelessly.

But here is my perspective as a security engineer: imperfect interpretability is infinitely better than no interpretability. We do not need to fully reverse-engineer a model to detect a backdoor, any more than we need to fully understand an operating system to detect malware. We need tools that give us enough visibility to identify anomalies — and that is exactly what these techniques provide.

What’s Coming Next

In Part 5, we will move from theory to practice and build your first AI security lab. We will set up the tools, load real models, and start running the interpretability experiments we have been discussing. I will walk you through your first activation patching experiment, your first sparse autoencoder analysis, and your first attempt at identifying circuits in a live model.

We have spent four articles building the conceptual foundation. Now we build the workbench.

References

Belinkov, Y. (2022). Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics, 48(1), 207-219.
Belrose, N., et al. (2023). Eliciting Latent Predictions from Transformers with the Tuned Lens. arXiv preprint arXiv:2303.08112.
Bricken, T., et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic Research.
Casper, S., et al. (2023). Black-Box Access is Insufficient for Rigorous AI Audits. arXiv preprint arXiv:2401.14446.
Cunningham, H., et al. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. ICLR.
Elhage, N., et al. (2022). Toy Models of Superposition. Anthropic Research.
Huang, J., et al. (2024). Rethinking Interpretability in the Era of Large Language Models. arXiv preprint arXiv:2402.01761.
Meng, K., Bau, D., Mitchell, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS.
Nanda, N. (2023). Mechanistic Interpretability Quickstart Guide. Personal Blog.
nostalgebraist (2020). interpreting GPT: the logit lens. LessWrong.
Olah, C., et al. (2020). Zoom In: An Introduction to Circuits. Distill.
Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic Research.
Wang, K., et al. (2023). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small. ICLR.

Join the Mission

This is just the beginning. I will be sharing my code, data, and research findings as I go. If you are interested in the intersection of AI, Quantum, and Security, I’d love to connect.

GitHub: github.com/bitghostsecurity
Collaborate: hello@bitghostsecurity.com

Hardened Logic for an Intelligent Era.

Share on

X Facebook LinkedIn Bluesky

Bit Ghost Security