Part 9: The Concept Cartographer – Mapping Meaning in High Dimensions
This is Part 9 of a 12-part series exploring the intersection of artificial intelligence and cybersecurity. In Part 6 we captured activations. In Part 7 we fingerprinted them. In Part 8 we decomposed them into interpretable features. Today we make them visible.
Why We Need a Map
Every security discipline eventually builds maps. Network engineers draw topology diagrams. Reverse engineers build call graphs in IDA. Threat hunters mine ATT&CK matrices. Cloud auditors stare at graph views of IAM policies. Maps are not decorative. They are how humans reason about state spaces too large to hold in memory.
We have spent three articles turning prompts into high-dimensional vectors. A single fingerprint in Part 7 was [12, 768] — 9,216 numbers. A single feature vector in Part 8 was thousands of sparse dimensions. We can cluster them, we can compare them, we can classify against them. But we cannot look at them, and until we can look at them, our intuitions about the concept space are going to be poor.
Today we build a concept cartographer — a tool that takes a corpus of feature vectors and produces an interactive 2D or 3D map. Points close together share internal representation. Coloring by category, feature activation, or free-form label lets you inspect the geography of what the model “knows.” It is the same visual leap as going from a text list of open ports to a topology graph — nothing new is discovered, but everything becomes navigable.
The Projection Problem
Reducing thousands of dimensions to two is a lossy operation. There is no free lunch. Every algorithm makes a choice about what to preserve — and understanding those choices is the difference between a map that reveals structure and one that fabricates it.
Three algorithms dominate the space, and they optimize for different things:
-
PCA preserves the directions of largest variance. Fast, deterministic, linear. Great for a first look but blind to nonlinear structure — if categories curve through the space, PCA will overlay them.
-
t-SNE (van der Maaten & Hinton, 2008) preserves local neighborhoods at the cost of global structure. Two nearby points in the map are nearby in the original space; two distant clusters may have their distances distorted arbitrarily. Never trust a t-SNE plot to tell you the size of a gap between clusters.
-
UMAP (McInnes et al., 2018) tries to preserve both local and global structure using topological arguments. Empirically it gives more faithful maps than t-SNE for large corpora, and it is usually the right default for our purposes.
We are going to build our tool around UMAP with PCA as a fast fallback, and expose t-SNE as an option for cases where local structure is what matters.
Building the Cartographer
Install the dependencies:
pip install umap-learn plotly
Then the tool. Save as concept_map.py.
# concept_map.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Literal
import numpy as np
import plotly.graph_objects as go
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
Projection = Literal["umap", "pca", "tsne"]
@dataclass
class ConceptMap:
coords: np.ndarray # [N, 2] or [N, 3]
prompts: list[str]
labels: list[dict]
method: Projection
dim: int
def project(
vectors: np.ndarray,
method: Projection = "umap",
dim: int = 2,
random_state: int = 0,
**kwargs,
) -> np.ndarray:
if method == "pca":
return PCA(n_components=dim, random_state=random_state).fit_transform(vectors)
if method == "tsne":
perplexity = min(30, max(5, vectors.shape[0] // 3))
return TSNE(
n_components=dim,
perplexity=kwargs.get("perplexity", perplexity),
random_state=random_state,
init="pca",
).fit_transform(vectors)
if method == "umap":
import umap
reducer = umap.UMAP(
n_components=dim,
n_neighbors=kwargs.get("n_neighbors", 15),
min_dist=kwargs.get("min_dist", 0.1),
random_state=random_state,
)
return reducer.fit_transform(vectors)
raise ValueError(f"Unknown projection: {method}")
def build_map(
vectors: np.ndarray,
prompts: list[str],
labels: list[dict],
method: Projection = "umap",
dim: int = 2,
) -> ConceptMap:
coords = project(vectors, method=method, dim=dim)
return ConceptMap(
coords=coords,
prompts=prompts,
labels=labels,
method=method,
dim=dim,
)
def render_html(
concept_map: ConceptMap,
color_by: str = "category",
title: str = "Concept Map",
output_path: str | Path = "concept_map.html",
) -> Path:
"""Render an interactive Plotly HTML visualization of the map."""
color_values = [lbl.get(color_by, "unknown") for lbl in concept_map.labels]
# Truncate prompt text for hover labels
hover = [
f"<b>[{c}]</b><br>{p[:80]}{'...' if len(p) > 80 else ''}"
for c, p in zip(color_values, concept_map.prompts)
]
if concept_map.dim == 2:
fig = go.Figure()
for cat in sorted(set(color_values)):
mask = [c == cat for c in color_values]
xs = concept_map.coords[mask, 0]
ys = concept_map.coords[mask, 1]
texts = [h for h, m in zip(hover, mask) if m]
fig.add_trace(go.Scatter(
x=xs, y=ys, mode="markers", name=str(cat),
text=texts, hoverinfo="text",
marker=dict(size=10, opacity=0.8),
))
fig.update_layout(
title=f"{title} ({concept_map.method.upper()}, 2D, color by {color_by})",
xaxis_title=f"{concept_map.method}_1",
yaxis_title=f"{concept_map.method}_2",
template="plotly_dark",
)
else:
fig = go.Figure()
for cat in sorted(set(color_values)):
mask = [c == cat for c in color_values]
xs = concept_map.coords[mask, 0]
ys = concept_map.coords[mask, 1]
zs = concept_map.coords[mask, 2]
texts = [h for h, m in zip(hover, mask) if m]
fig.add_trace(go.Scatter3d(
x=xs, y=ys, z=zs, mode="markers", name=str(cat),
text=texts, hoverinfo="text",
marker=dict(size=6, opacity=0.8),
))
fig.update_layout(
title=f"{title} ({concept_map.method.upper()}, 3D, color by {color_by})",
template="plotly_dark",
)
output_path = Path(output_path)
fig.write_html(str(output_path))
return output_path
The tool has a strict separation between projection (numbers to numbers) and rendering (numbers to pixels). The ConceptMap object is the interchange format between them. This matters more than it looks — if you later want to render to Matplotlib, D3, or a Jupyter widget, you only replace render_html; the projection is untouched.
First Map: Fingerprints on a Page
Let’s put the corpus from Parts 7 and 8 on a map. Save this as map_corpus.py.
# map_corpus.py
import numpy as np
from pathlib import Path
from prompt_fingerprint import fingerprint_directory
from concept_map import build_map, render_html
LAYER = 6
fingerprints = fingerprint_directory("./traces", signal="hook_resid_post", pooling="mean")
# Use per-layer vector at the best-silhouette layer identified in Part 7
X = np.stack([fp.layer(LAYER) for fp in fingerprints])
prompts = [fp.prompt for fp in fingerprints]
labels = [fp.labels for fp in fingerprints]
cm = build_map(X, prompts, labels, method="umap", dim=2)
path = render_html(
cm,
color_by="category",
title=f"Prompt Fingerprint Map (Layer {LAYER})",
output_path="fingerprint_map.html",
)
print(f"Open in a browser: {path.resolve()}")
Open the HTML in your browser. What you should see is a scattergram of your corpus, with each category rendered in a different color, and a hover tooltip showing the actual prompt text at each point. This is the first time in the series you can see the concept space.
Move your mouse over the region that clusters injection prompts. Are the injection prompts you expected there? Are there prose or code prompts that leak into the injection region? Those leaks are usually the most interesting finding of the whole exercise. They are the prompts a fingerprint-based detector would confuse.
The Feature Map: Where Concepts Live
Now let’s map on top of feature vectors from Part 8 instead of raw residual fingerprints. Save this as feature_map.py.
# feature_map.py
import numpy as np
from pathlib import Path
from feature_probe import load_pretrained_sae, decompose_trace
from concept_map import build_map, render_html
LAYER = 6
sae, _ = load_pretrained_sae(
release="gpt2-small-res-jb",
sae_id=f"blocks.{LAYER}.hook_resid_pre",
)
feature_vectors = []
prompts = []
labels = []
for meta_path in sorted(Path("./traces").glob("*.json")):
fv = decompose_trace(meta_path, sae, layer=LAYER)
feature_vectors.append(fv.features.mean(axis=0))
prompts.append(fv.prompt)
labels.append(fv.labels)
X = np.stack(feature_vectors)
cm = build_map(X, prompts, labels, method="umap", dim=2)
path = render_html(
cm,
color_by="category",
title=f"Feature-Space Concept Map (Layer {LAYER})",
output_path="feature_map.html",
)
print(f"Open in a browser: {path.resolve()}")
Compare the two HTML maps side by side. On my corpus, the feature-space map shows tighter, more separated category clusters — this is the qualitative version of the silhouette lift we measured in Part 8. What was a fuzzy overlap in fingerprint space becomes a clean partition in feature space.
Coloring by a Single Feature
The map is even more useful when you color it by a specific feature you want to understand. Say you identified feature #1234 in Part 8 as an “injection-in-general” candidate. Let’s see where on the map it fires.
# color_by_feature.py
import numpy as np
from pathlib import Path
from feature_probe import load_pretrained_sae, decompose_trace
from concept_map import build_map, render_html
LAYER = 6
FEATURE_ID = 1234
sae, _ = load_pretrained_sae(
release="gpt2-small-res-jb",
sae_id=f"blocks.{LAYER}.hook_resid_pre",
)
feature_vectors = []
prompts = []
labels = []
for meta_path in sorted(Path("./traces").glob("*.json")):
fv = decompose_trace(meta_path, sae, layer=LAYER)
feature_vectors.append(fv.features.mean(axis=0))
prompts.append(fv.prompt)
# Add the target feature's activation as a synthetic label
lbl = dict(fv.labels)
activation = float(fv.features[:, FEATURE_ID].max())
lbl[f"feature_{FEATURE_ID}"] = f"{activation:.2f}"
lbl["feature_bucket"] = (
"off" if activation < 0.05 else
"weak" if activation < 0.3 else
"strong"
)
labels.append(lbl)
X = np.stack(feature_vectors)
cm = build_map(X, prompts, labels, method="umap", dim=2)
render_html(
cm,
color_by="feature_bucket",
title=f"Feature #{FEATURE_ID} Activation on Concept Map",
output_path=f"feature_{FEATURE_ID}_map.html",
)
The visualization now colors every point by whether feature #1234 is off, weakly active, or strongly active. If the feature is a good “injection detector,” the strong points should tightly overlap your labeled injection region. If they leak into other regions — say, some refuse prompts also fire the feature strongly — you have found the feature’s false-positive surface, mapped visually.
This is the moment where the map earns its keep. You are no longer asking “is this feature good?” as a scalar question. You are asking “where on the concept space does this feature fire?” That is a spatial question that a good map answers instantly.
3D and Trajectories
For talks, papers, or the moment you first show a stakeholder what you have been doing, a 3D map is dramatic. It also captures more structure than 2D when the data actually has it.
# feature_map_3d.py
from concept_map import build_map, render_html
# ... same feature vector loading as before ...
cm = build_map(X, prompts, labels, method="umap", dim=3)
render_html(cm, color_by="category", output_path="feature_map_3d.html")
Open it, rotate it. Categories that were on top of each other in 2D often separate cleanly along the third axis. You will also see the shape of the concept space — sometimes a tight ball, sometimes a curved manifold, sometimes multiple disconnected islands. That shape is a real property of the model, not an artifact of the projection.
For an even richer view, we can plot the trajectory of a single prompt as it moves through the layers. Save as layer_trajectory.py.
# layer_trajectory.py
import numpy as np
from pathlib import Path
import plotly.graph_objects as go
from prompt_fingerprint import fingerprint_directory
fingerprints = fingerprint_directory("./traces")
# Stack every (prompt, layer) into a single big point cloud, then project.
all_vectors = []
row_labels = []
for fp in fingerprints:
for layer in range(fp.n_layers):
all_vectors.append(fp.layer(layer))
row_labels.append({
"prompt": fp.prompt,
"category": fp.labels.get("category", "?"),
"layer": layer,
})
X = np.stack(all_vectors)
import umap
coords = umap.UMAP(n_components=2, random_state=0).fit_transform(X)
fig = go.Figure()
# One trace per prompt, drawing the layer-by-layer path through UMAP space.
n_layers = fingerprints[0].n_layers
prompts_in_order = [fp.prompt for fp in fingerprints]
for i, prompt in enumerate(prompts_in_order):
idx = [j for j, lbl in enumerate(row_labels)
if lbl["prompt"] == prompt]
xs = coords[idx, 0]
ys = coords[idx, 1]
fig.add_trace(go.Scatter(
x=xs, y=ys,
mode="lines+markers",
name=f"{row_labels[idx[0]]['category']}: {prompt[:30]}",
line=dict(width=1),
marker=dict(
size=[4 + 8*(l/n_layers) for l in range(n_layers)],
opacity=0.7,
),
text=[f"Layer {l}" for l in range(n_layers)],
hoverinfo="text+name",
))
fig.update_layout(
title="Layer-by-Layer Trajectory of Prompts Through Concept Space",
template="plotly_dark",
showlegend=False,
)
fig.write_html("trajectories.html")
Each prompt now appears as a curve rather than a point — a path through concept space from layer 0 to layer 11. Prompts in the same category often follow parallel trajectories. When two prompts converge suddenly at a specific layer, you have found evidence that that layer is where the model “decides” what category the prompt belongs to. When they diverge unexpectedly late, you have found a layer where the model treats surface-similar prompts differently.
This is the most information-dense visualization in the series, and it is worth staring at.
The Security Analyst’s Dashboard
We now have enough tooling to build a working analyst dashboard. Save this as dashboard.py.
# dashboard.py
"""Generate a small suite of HTML maps for a corpus in one go."""
import numpy as np
from pathlib import Path
from prompt_fingerprint import fingerprint_directory
from feature_probe import load_pretrained_sae, decompose_trace
from concept_map import build_map, render_html
TRACES = Path("./traces")
LAYER = 6
OUT = Path("./dashboard")
OUT.mkdir(exist_ok=True)
# 1. Raw fingerprint map (2D)
fingerprints = fingerprint_directory(TRACES)
X = np.stack([fp.layer(LAYER) for fp in fingerprints])
prompts = [fp.prompt for fp in fingerprints]
labels = [fp.labels for fp in fingerprints]
cm = build_map(X, prompts, labels, method="umap", dim=2)
render_html(cm, color_by="category", output_path=OUT / "01_fingerprint.html",
title=f"Raw Fingerprint Map (Layer {LAYER})")
# 2. Feature map (2D)
sae, _ = load_pretrained_sae(
release="gpt2-small-res-jb",
sae_id=f"blocks.{LAYER}.hook_resid_pre",
)
feature_vectors, feat_prompts, feat_labels = [], [], []
for meta_path in sorted(TRACES.glob("*.json")):
fv = decompose_trace(meta_path, sae, layer=LAYER)
feature_vectors.append(fv.features.mean(axis=0))
feat_prompts.append(fv.prompt)
feat_labels.append(fv.labels)
X_feat = np.stack(feature_vectors)
cm_feat = build_map(X_feat, feat_prompts, feat_labels, method="umap", dim=2)
render_html(cm_feat, color_by="category",
output_path=OUT / "02_features.html",
title=f"Feature Map (Layer {LAYER})")
# 3. Feature map (3D)
cm_feat_3d = build_map(X_feat, feat_prompts, feat_labels, method="umap", dim=3)
render_html(cm_feat_3d, color_by="category",
output_path=OUT / "03_features_3d.html",
title=f"Feature Map 3D (Layer {LAYER})")
# 4. PCA comparison (fast, deterministic — sanity check for the UMAP)
cm_pca = build_map(X_feat, feat_prompts, feat_labels, method="pca", dim=2)
render_html(cm_pca, color_by="category",
output_path=OUT / "04_features_pca.html",
title=f"Feature PCA (Layer {LAYER})")
print(f"\nDashboard written to {OUT.resolve()}")
print("Open the HTML files in a browser to explore.")
Four maps, one command. The 03_features_3d.html file is the one I keep open on a second monitor when I am investigating a new corpus. The 04_features_pca.html file is a sanity check — if PCA and UMAP both show similar cluster structure, the finding is robust. If they disagree wildly, one of them is telling you something the other is hiding, and you need to think carefully about which is more faithful for your question.
What the Map Cannot Show
Every visualization comes with an obligation to be honest about what it hides.
Distances between clusters are not always meaningful. UMAP tries to preserve them; t-SNE actively distorts them. If two clusters look far apart in a UMAP plot, that is some evidence of structural distance in the original space, but it is not proof.
Randomness matters. UMAP and t-SNE both depend on random initialization. Always set random_state. If your findings change materially when you rerun with a different seed, your findings are noise.
High-dimensional distances are strange. In 9,216-dimensional space, all pairs of random vectors are approximately the same distance apart. That is called the “curse of dimensionality,” and it is why we normalize activations before projection when they have very different scales.
A map is not a mechanism. Even a beautiful, clean, well-separated map does not prove the model uses the features you are visualizing to produce its outputs. That is a causal claim, and we have not made it yet. We will make it in Part 10.
Homework: Anomaly Hunt
Before Part 10:
- Capture a “known-good” corpus of 50-100 prompts your model handles routinely without issue.
- Capture a small “unknown” set — 5-10 prompts you have not labeled and are genuinely uncertain about.
- Build a feature map of the union, coloring the known-good prompts by their category and marking the unknown prompts with a fifth color.
- Look at where the unknown prompts land. Are they inside a known category? Between categories? In a region with no known-good neighbors?
The exercise trains an intuition that no equation can teach: what “anomalous” looks like on a concept map. Every good security engineer builds this intuition. Now you are building it for AI systems.
Where We Stand and What’s Ahead
Nine articles in:
- Part 1: The language — tensors, ranks, shapes
- Part 2: The architecture — embeddings, attention, transformers
- Part 3: The threat landscape — input, weight, output attacks
- Part 4: The interpretability toolbox — SAEs, circuits, patching, probing
- Part 5: The workbench — PyTorch, TransformerLens, first experiments
- Part 6: The instrument — a reusable activation logger
- Part 7: The first analysis — fingerprinting prompts by their internal footprint
- Part 8: The upgrade — decomposing tangled activations into interpretable features
- Part 9: The atlas — turning feature vectors into navigable visual maps
We have four tools now. activation_logger captures. prompt_fingerprint compares. feature_probe interprets. concept_map reveals. The system is more than half built.
In Part 10 — Finding the Edit Points: Causal Tracing at the Tensor Level — we finally get to causality. Every finding so far has been correlational: features fire when certain prompts arrive, clusters form on certain maps. But to build a defense, we need to know which activations cause a given behavior. We will build a causal tracing tool that patches activations from one prompt into another and localizes exactly which layer, token, and dimension is responsible for the difference in output.
The map shows us where the concepts live. Causal tracing shows us which parts of the map we can actually reach in and change.
References
- Coenen, A., et al. (2019). Visualizing and Measuring the Geometry of BERT. NeurIPS.
- Kobak, D., & Berens, P. (2019). The Art of Using t-SNE for Single-Cell Transcriptomics. Nature Communications, 10(1), 1-14.
- McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
- Plotly Technologies Inc. (2015). Collaborative Data Science. Plotly Technical Documentation.
- van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86), 2579-2605.
Join the Mission
This is just the beginning. I will be sharing my code, data, and research findings as I go. If you are interested in the intersection of AI, Quantum, and Security, I’d love to connect.
- GitHub: github.com/bitghostsecurity
- Collaborate: hello@bitghostsecurity.com
Hardened Logic for an Intelligent Era.