Part 8: Untangling Superposition – Reading Features Instead of Neurons
This is Part 8 of a 12-part series exploring the intersection of artificial intelligence and cybersecurity. In Part 6 we built the activation logger. In Part 7 we turned traces into fingerprints and clustered prompts by category. Today we confront the reason those clusters are fuzzy: superposition.
The Problem With Neurons
Last month, when we clustered prompts by their fingerprint, we saw something that should have bothered us. Some categories separated cleanly. Others — especially injection and refuse — sat on top of each other in fingerprint space, even though they mean very different things to a security engineer. The temptation is to blame the labels or the pooling. It is more honest to blame the representation itself.
Here is the uncomfortable truth about how neural networks store information: a single neuron does not correspond to a single concept. The 4,321st dimension of GPT-2 Small’s residual stream is not “the injection detector.” It is not “the code neuron.” It is a linear combination of contributions from many features at once. Some of it fires on injection prompts. Some of it fires on code. Some of it fires on the word “please.” When you mean-pool that neuron across a corpus, you get a value that is a fuzzy average of every concept it participates in.
This phenomenon has a name: superposition. Networks with more features to represent than they have dimensions to represent them in compress features into overlapping directions of the activation space. Elhage et al. (2022) formalized why this is not a bug — for a model trained on a rich enough distribution, superposition is provably the most efficient encoding under sparsity assumptions. The features are still there. They are just tangled.
For a security engineer, the analogy that finally made this click for me is packed executables. When you dump a UPX-packed binary, you do not see the malware’s real strings and imports. They are encrypted and folded into a compressed payload. You need an unpacker to recover the underlying features. Superposition is the same problem for neural networks, and the unpacker we are going to use is called a sparse autoencoder.
What a Sparse Autoencoder Actually Does
A sparse autoencoder (SAE) is a small neural network that learns to represent activations as a sparse combination of interpretable features. Given an activation vector \(x \in \mathbb{R}^{d_{\text{model}}}\), the SAE learns two things:
\[f(x) = \text{ReLU}(W_{\text{enc}} \, x + b_{\text{enc}})\] \[\hat{x} = W_{\text{dec}} \, f(x) + b_{\text{dec}}\]The feature vector \(f(x) \in \mathbb{R}^{d_{\text{sae}}}\) usually has more dimensions than the input (d_sae might be 8 or 16 times d_model) but with an L1 sparsity penalty that forces most of those features to be zero for any given input. The training loss is:
Intuitively: reconstruct the activation, but only by turning on the smallest possible number of features. When you succeed, each feature you do turn on tends to correspond to something recognizable — a specific concept, syntactic pattern, or domain. That is the “unpacking” step. Bricken et al. (2023) demonstrated that features learned this way are dramatically more interpretable than raw neurons, and Templeton et al. (2024) scaled the result to production-size models.
For our purposes, an SAE turns each Part 6 activation vector into a feature activation vector — a much longer but much sparser representation where each nonzero entry corresponds to an interpretable direction. That is the fingerprint we actually wanted in Part 7.
Two Paths: Pre-trained SAE vs. Hand-Rolled
There are two ways to get an SAE for the work we are about to do:
-
Use a pre-trained one. SAELens ships with community-trained SAEs for GPT-2 Small and several other models. This is the fast path — a few lines of code and you are decomposing activations against features that other researchers have already validated and named.
-
Train your own on a small activation dataset. This takes more compute and time, but you learn why the features look the way they do, and you can target the layer and signal that matters most for your problem.
I recommend doing both, in that order. Start with the pre-trained SAE to feel what feature-level analysis is like, then train your own to feel what the mechanism does. This article covers both.
Path 1: Loading a Pre-Trained SAE
First, install SAELens:
pip install sae_lens
Then this drop-in tool. Save as feature_probe.py.
# feature_probe.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import torch
from sae_lens import SAE
from activation_logger import load_trace
@dataclass
class FeatureVector:
trace_id: str
prompt: str
labels: dict
layer: int
# Sparse feature activations, shape [seq_len, d_sae]
features: np.ndarray
@property
def n_active(self) -> int:
"""Total nonzero feature-position pairs across the sequence."""
return int((self.features != 0).sum())
def top_features(self, k: int = 20) -> list[tuple[int, float]]:
"""Return (feature_id, activation) sorted by summed activation."""
summed = self.features.sum(axis=0)
idx = np.argsort(summed)[::-1][:k]
return [(int(i), float(summed[i])) for i in idx]
def load_pretrained_sae(
release: str = "gpt2-small-res-jb",
sae_id: str = "blocks.6.hook_resid_pre",
) -> tuple[SAE, dict]:
sae, cfg_dict, _ = SAE.from_pretrained(release=release, sae_id=sae_id)
return sae, cfg_dict
def decompose_trace(
meta_path: str | Path,
sae: SAE,
layer: int,
signal: str = "hook_resid_pre",
) -> FeatureVector:
# The default is hook_resid_pre because that is what the community's
# gpt2-small-res-jb SAEs are trained on. If you use an SAE trained on
# hook_resid_post (or hook_attn_out, hook_mlp_out), pass that explicitly.
# Signal-mismatched SAEs will still produce output but the features will
# be nonsense.
meta, acts = load_trace(meta_path)
key = f"blocks.{layer}.{signal}"
activations = acts[key] # [seq_len, d_model]
with torch.no_grad():
features = sae.encode(activations) # [seq_len, d_sae]
return FeatureVector(
trace_id=meta["trace_id"],
prompt=meta["prompt"],
labels=meta.get("labels", {}),
layer=layer,
features=features.detach().cpu().numpy(),
)
Now use it against the corpus you built in Part 7.
# feature_summary.py
from pathlib import Path
from feature_probe import load_pretrained_sae, decompose_trace
sae, cfg = load_pretrained_sae(
release="gpt2-small-res-jb",
sae_id="blocks.6.hook_resid_pre",
)
# Note the layer here matches the layer we pulled the SAE for. Mismatches
# produce garbage silently - SAE features are learned specifically for the
# activation distribution at one hook.
LAYER = 6
for meta_path in sorted(Path("./traces").glob("*.json")):
fv = decompose_trace(meta_path, sae, layer=LAYER)
cat = fv.labels.get("category", "?")
print(f"\n[{cat:<10}] {fv.prompt!r}")
print(f" active features: {fv.n_active}")
print(f" top 5 features (by summed activation):")
for fid, act in fv.top_features(5):
print(f" feature #{fid:<6} activation={act:.3f}")
Run this. You will see something remarkable: each prompt lights up a small number of features, usually fewer than 50 out of thousands. That is the sparsity the SAE was trained to enforce, and it is exactly what makes the resulting representation interpretable.
What to notice:
- Prompts in the same category tend to share several top features. Two different injection prompts should have overlapping feature IDs in their top 20.
- Prompts across different categories usually do not share features. That is the separation Part 7’s fingerprints could not always find, made explicit.
- Some features fire on almost every prompt (positional or syntactic features). These are the “background” — worth ignoring when you are looking for category-specific signals.
Feature-Level Fingerprints Beat Neuron-Level Fingerprints
Let’s rerun Part 7’s clustering analysis, but now on feature activations instead of raw residual vectors.
# feature_cluster.py
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from feature_probe import load_pretrained_sae, decompose_trace
from prompt_fingerprint import fingerprint_directory
LAYER = 6
# Baseline: raw residual fingerprints from Part 7
raw_fps = fingerprint_directory("./traces", signal="hook_resid_post", pooling="mean")
raw_X = np.stack([fp.layer(LAYER) for fp in raw_fps])
# Feature-level fingerprints via the SAE
sae, _ = load_pretrained_sae(
release="gpt2-small-res-jb",
sae_id=f"blocks.{LAYER}.hook_resid_pre",
)
feature_fps = [
decompose_trace(p, sae, layer=LAYER)
for p in sorted(Path("./traces").glob("*.json"))
]
# Mean-pool feature activations across the sequence
feature_X = np.stack([fv.features.mean(axis=0) for fv in feature_fps])
y = np.array([fp.labels.get("category", "unknown") for fp in raw_fps])
raw_sil = silhouette_score(raw_X, y)
feat_sil = silhouette_score(feature_X, y)
print(f"Silhouette (raw residual): {raw_sil:.3f}")
print(f"Silhouette (SAE features): {feat_sil:.3f}")
print(f"Feature representation dim: {feature_X.shape[1]:,}")
print(f"Mean nonzero features per prompt: "
f"{(feature_X != 0).sum(axis=1).mean():.1f}")
# Side-by-side PCA visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
for ax, X, title in [
(axes[0], raw_X, f"Raw residual (silhouette={raw_sil:.2f})"),
(axes[1], feature_X, f"SAE features (silhouette={feat_sil:.2f})"),
]:
X_2d = PCA(n_components=2).fit_transform(X)
for cat in sorted(set(y)):
mask = y == cat
ax.scatter(X_2d[mask, 0], X_2d[mask, 1], label=cat, s=50, alpha=0.8)
ax.set_title(title)
ax.legend(fontsize=8)
ax.set_xticks([]); ax.set_yticks([])
plt.tight_layout()
plt.savefig("features_vs_raw.png", dpi=150)
plt.show()
On my corpus, feature-level clustering pushes silhouette from around 0.15 to somewhere above 0.35 — more than doubling the category separability at the same layer. Your numbers will vary, but the direction of the effect is robust. Untangling superposition is not a marginal improvement; it is the single largest signal-quality lift in this series so far.
The Homework, Answered
Part 7 ended with a homework question: if you capture 20 injection prompts in one style and 20 in a very different style, do they cluster as a single “injection” category or as two? Let’s answer it with the feature tool.
# injection_sub_structure.py
import numpy as np
from collections import defaultdict
from pathlib import Path
from feature_probe import load_pretrained_sae, decompose_trace
LAYER = 6
sae, _ = load_pretrained_sae(
release="gpt2-small-res-jb",
sae_id=f"blocks.{LAYER}.hook_resid_pre",
)
# Assumes you labeled: {"category": "injection", "style": "prefix"}
# for the "Ignore previous..." style and {"category": "injection",
# "style": "roleplay"} for the "You are now DAN..." style.
prefix_features = []
roleplay_features = []
for meta_path in sorted(Path("./traces").glob("*.json")):
fv = decompose_trace(meta_path, sae, layer=LAYER)
if fv.labels.get("category") != "injection":
continue
pooled = fv.features.mean(axis=0)
if fv.labels.get("style") == "prefix":
prefix_features.append(pooled)
elif fv.labels.get("style") == "roleplay":
roleplay_features.append(pooled)
prefix_mean = np.mean(prefix_features, axis=0)
roleplay_mean = np.mean(roleplay_features, axis=0)
# Which features fire strongly for BOTH styles? Those are candidate
# "injection-in-general" features.
shared = (prefix_mean > 0.2) & (roleplay_mean > 0.2)
prefix_only = (prefix_mean > 0.2) & (roleplay_mean < 0.05)
roleplay_only = (roleplay_mean > 0.2) & (prefix_mean < 0.05)
print(f"Shared 'injection' features: {shared.sum()}")
print(f"Prefix-only features: {prefix_only.sum()}")
print(f"Roleplay-only features: {roleplay_only.sum()}")
for name, mask, source in [
("SHARED", shared, prefix_mean + roleplay_mean),
("PREFIX", prefix_only, prefix_mean),
("ROLEPLAY", roleplay_only, roleplay_mean),
]:
ids = np.where(mask)[0]
ids = sorted(ids, key=lambda i: -source[i])[:5]
print(f"\nTop 5 {name} features:")
for fid in ids:
print(f" feature #{fid:<6} "
f"prefix={prefix_mean[fid]:.3f} "
f"roleplay={roleplay_mean[fid]:.3f}")
This is where feature-level analysis pays for itself. On a real corpus, you will find that “injection” is not a single feature. It is a small set of shared features (the true, style-invariant injection signal) plus a larger set of style-specific features that just happen to fire together within one attack family. The shared set is what a robust detector should key on. Everything else is noise from a specific author’s phrasing habits.
This is the same insight as when malware analysts distinguish core payload features from cosmetic packing variations. Two different UPX-packed samples share the packer’s unpacking stub — that is the reliable signature. The payload underneath is what matters for classification.
Path 2: A Tiny SAE You Train Yourself
Loading a pre-trained SAE is fine, but it is a black box. If you want to feel the mechanism, train a small one on the activations you have already captured. This is a heavily simplified version — production SAEs use tricks like ghost gradients, resampling, and careful learning-rate schedules that are out of scope here. But it captures the essential idea.
# tiny_sae.py
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from pathlib import Path
from activation_logger import load_trace
class TinySAE(nn.Module):
def __init__(self, d_model: int, d_sae: int):
super().__init__()
self.encoder = nn.Linear(d_model, d_sae)
self.decoder = nn.Linear(d_sae, d_model)
def encode(self, x: torch.Tensor) -> torch.Tensor:
return F.relu(self.encoder(x))
def decode(self, f: torch.Tensor) -> torch.Tensor:
return self.decoder(f)
def forward(self, x: torch.Tensor):
f = self.encode(x)
x_hat = self.decode(f)
return x_hat, f
def collect_activations(traces_dir: Path, layer: int) -> torch.Tensor:
all_acts = []
for meta_path in sorted(traces_dir.glob("*.json")):
_, acts = load_trace(meta_path)
a = acts[f"blocks.{layer}.hook_resid_post"] # [seq, d_model]
all_acts.append(a)
return torch.cat(all_acts, dim=0) # [total_positions, d_model]
def train_tiny_sae(
activations: torch.Tensor,
d_sae: int,
epochs: int = 200,
lr: float = 1e-3,
l1_lambda: float = 1e-3,
) -> TinySAE:
d_model = activations.shape[1]
sae = TinySAE(d_model, d_sae)
optim = torch.optim.Adam(sae.parameters(), lr=lr)
for epoch in range(epochs):
idx = torch.randperm(activations.shape[0])[:2048]
batch = activations[idx]
x_hat, f = sae(batch)
recon = F.mse_loss(x_hat, batch)
sparsity = f.abs().mean()
loss = recon + l1_lambda * sparsity
optim.zero_grad()
loss.backward()
optim.step()
if epoch % 20 == 0:
frac_active = (f > 0).float().mean().item()
print(f"epoch {epoch:>3} recon={recon.item():.4f} "
f"sparsity={sparsity.item():.4f} "
f"frac_active={frac_active:.3f}")
return sae
if __name__ == "__main__":
LAYER = 6
D_SAE = 3072 # 4x expansion for GPT-2 Small (d_model=768)
acts = collect_activations(Path("./traces"), layer=LAYER)
print(f"Training on {acts.shape[0]} activation vectors, dim={acts.shape[1]}")
sae = train_tiny_sae(acts, d_sae=D_SAE)
torch.save(sae.state_dict(), f"tiny_sae_layer{LAYER}.pt")
print(f"Saved: tiny_sae_layer{LAYER}.pt")
Two hundred epochs on a small corpus takes a couple of minutes on a CPU. What you should watch is the frac_active value — the fraction of features that are nonzero on any given input. It should start high (all features firing) and drop toward something like 0.02 or 0.05 as the sparsity penalty takes effect. That drop is superposition being untangled. Each feature is specializing.
To use your trained SAE, wrap it in the same FeatureVector interface as feature_probe.py. The rest of the tools do not care whether the SAE came from SAELens or from your basement.
Interpreting a Feature
A feature is only useful if you can name what it detects. The standard technique is to find the prompts in your corpus where that feature fires most strongly, then read them and look for a pattern.
# name_a_feature.py
from pathlib import Path
import numpy as np
from feature_probe import load_pretrained_sae, decompose_trace
LAYER = 6
FEATURE_ID = 1234 # replace with a feature you saw fire strongly
sae, _ = load_pretrained_sae(
release="gpt2-small-res-jb",
sae_id=f"blocks.{LAYER}.hook_resid_pre",
)
hits = []
for meta_path in sorted(Path("./traces").glob("*.json")):
fv = decompose_trace(meta_path, sae, layer=LAYER)
max_activation = fv.features[:, FEATURE_ID].max()
hits.append((max_activation, fv.prompt, fv.labels))
hits.sort(reverse=True)
print(f"Top 10 prompts activating feature #{FEATURE_ID}:")
for act, prompt, labels in hits[:10]:
print(f" {act:.3f} [{labels.get('category', '?'):<10}] {prompt!r}")
Look at the top 10. If they share a theme — say, all of them are questions, or all of them mention system-level access, or all of them are second-person imperatives — you have found a real feature. Write it down. When you compare corpora next month, that feature ID is a stable signal you can key on.
For pre-trained SAEs, Neuronpedia (Bloom et al., 2023) hosts community-labeled features for common SAE releases. Look up your feature ID there before you spend an hour naming it yourself.
The Security Angle: From Fingerprints to Named Detectors
In Part 7, our detector said: “this prompt’s activation vector at layer 6 is cosine-close to the centroid of my labeled injection category.” That is a fine detector but it is opaque. When it fires, you cannot say why.
Now the detector says: “this prompt strongly activates feature #1234 and feature #5678, which in my reference set fire together only on injection-style prompts.” That is auditable. You can point at a specific feature. You can inspect what else that feature fires on. You can defend the detector against a false-positive claim by showing that the same features fire on all known-good injection examples. You have moved from “the model behaved weirdly” to “the model recognized this specific pattern,” which is the language security teams need to act on.
This is the leap we could not make when we were staring at raw residual vectors. It is why the mechanistic-interpretability community is so obsessive about SAEs — not because features are prettier than neurons, but because features are the level at which decisions can be explained.
Honest Limits
Some things SAEs still do not solve well:
- Feature splitting: an SAE with more capacity often subdivides one “concept” into several very similar features. This is not necessarily a bug, but it complicates any code that assumes one-feature-per-concept.
- Dead features: many features in a trained SAE end up never firing. Production SAE training uses resampling to fix this; the tiny version we trained will just accept the loss.
- Layer specificity: an SAE trained on layer 6 tells you nothing about layer 8. To probe a full stack, you need a family of SAEs, one per layer. That is real training compute.
- Not causal: feature activation correlates with model behavior. It does not prove the feature caused the behavior. That claim requires patching, which is what we build in Part 10.
None of this makes SAEs less useful. It just means we should not oversell them.
Homework: Build Your Feature Vocabulary
Before Part 9:
- Pick your five strongest per-category features from the pre-trained SAE analysis above.
- For each, extract and name it using
name_a_feature.py. - Save the list somewhere durable — a text file, a CSV, whatever — with columns:
feature_id,label,example_prompts.
That file is your feature vocabulary. It will be the axis labels on the concept map we build in Part 9, and the target concepts for the causal experiments in Part 10.
Where We Stand and What’s Ahead
Eight articles in:
- Part 1: The language — tensors, ranks, shapes
- Part 2: The architecture — embeddings, attention, transformers
- Part 3: The threat landscape — input, weight, output attacks
- Part 4: The interpretability toolbox — SAEs, circuits, patching, probing
- Part 5: The workbench — PyTorch, TransformerLens, first experiments
- Part 6: The instrument — a reusable activation logger
- Part 7: The first analysis — fingerprinting prompts by their internal footprint
- Part 8: The upgrade — decomposing tangled activations into interpretable features
Our tool stack now has three components: activation_logger (capture), prompt_fingerprint (compare), and feature_probe (interpret). Each consumes what the previous produces. The shape of the eventual system is starting to become visible.
In Part 9 — The Concept Cartographer: Mapping Meaning in High Dimensions — we build a visualization tool that turns feature vectors into a navigable map of the model’s concept space. Points close together represent semantically similar internal states. Coloring by feature reveals the geography of what the model “knows.” For the first time, you will be able to look at the tensor world instead of just querying it programmatically.
Untangled features are the vocabulary. The map is the atlas.
References
- Bloom, J., & Chanin, D. (2023). Neuronpedia: A Public Repository of Interpretable Neural Network Features. Community Resource.
- Bricken, T., et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic Research.
- Cunningham, H., et al. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. ICLR.
- Elhage, N., et al. (2022). Toy Models of Superposition. Anthropic Research.
- Marks, S., et al. (2024). Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models. arXiv preprint arXiv:2403.19647.
- Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic Research.
Join the Mission
This is just the beginning. I will be sharing my code, data, and research findings as I go. If you are interested in the intersection of AI, Quantum, and Security, I’d love to connect.
- GitHub: github.com/bitghostsecurity
- Collaborate: hello@bitghostsecurity.com
Hardened Logic for an Intelligent Era.