Part 12: The Bitghost Debugger – A Proposal for Open-Source LLM Instrumentation

14 minute read

This is Part 12 — the final article — of a 12-part series exploring the intersection of artificial intelligence and cybersecurity. Across tensors, transformers, threat models, interpretability techniques, a lab setup, and six purpose-built tools — the logger, the fingerprinter, the feature probe, the concept map, the edit-point finder, and the steerer — we built the pieces. Today we put them on one bench.

What We Have Learned

Eleven months ago I set out to answer a specific question: is it possible for a security engineer, with no ML PhD and no data-center compute, to do meaningful research on the internals of language models? The answer this series has been quietly demonstrating is yes — provided the tooling exists. The reason it does not feel possible for most people right now is that the tooling is fragmented across a dozen research repositories, tuned for individual papers, and rarely designed to compose.

That is the gap this final article proposes to close. The Bitghost Debugger is not a new idea; it is the unification of the six tools we built in Parts 6 through 11, packaged behind a coherent interface with a shared data model, an installable CLI, and — critically — a GUI that a human analyst can drive without writing new PyTorch code for every investigation.

Before I describe the architecture, let me be honest about the scope of the claim. This series has built the skeleton of that debugger. The Python modules that ship with each post — activation_logger.py, prompt_fingerprint.py, feature_probe.py, concept_map.py, edit_points.py, steering.py — are the working skeleton. They compose. They share a trace format. They are testable end-to-end. What they are not, today, is a polished product. Part 12 is the roadmap from skeleton to product, and an invitation for the community to help build it.

The Design Principles

Before naming components, the principles they must respect.

Capture is separated from analysis. The one non-negotiable architectural rule of this series. tcpdump does not classify traffic; Wireshark does. Our logger does not fingerprint; our fingerprinter does. Every tool reads a common trace file, writes a well-typed output, and never smuggles model state across boundaries. This is what makes the tools reproducible.
Every finding is auditable back to an activation. A steering intervention that reduces injection success by 40% is uninteresting if you cannot show which activations, at which layers, on which prompts, produced the finding. Every derived artifact — fingerprint, feature vector, concept map, edit point, steering vector — retains a pointer back to the underlying trace files. Reviewers can trace claims to evidence.
Small models are first-class citizens. GPT-2 Small is not a toy; it is a tractable subject. Every technique in this series works there before it works anywhere else. The debugger’s default model is small enough to run on a laptop CPU so that adoption is not gated on GPU access.
The GUI is a wrapper around the CLI, not the other way around. Every action a user takes in the GUI corresponds to a shell command that could have been run instead. That is the discipline that keeps the tool scriptable, testable, and CI-friendly. It is the same discipline that makes git a good tool despite the existence of gitk.
Openness with instrumentation. The dual-use question has been in the room since Part 10. The debugger will ship with defensive monitoring components in the same repository as offensive ones. Not because that neutralizes the risk — it does not — but because it aligns incentives: any researcher who improves the attack surface is directly contributing to the tools defenders use to catch it.

The Proposed Architecture

Here is the system, in an ASCII diagram that a security engineer can read at a glance.

                    +-----------------------------+
                    |          Bitghost           |
                    |          Debugger           |
                    |            GUI              |
                    |  (browser + Plotly + Vite)  |
                    +--------------+--------------+
                                   |
                                   | JSON-RPC over WebSocket
                                   |
                    +--------------v--------------+
                    |     bitghost-server         |
                    |   (Python asyncio + FastAPI)|
                    +--------------+--------------+
                                   |
       +---------------------------+---------------------------+
       |                                                       |
       |         +-----------------+                           |
       +--------->  Session store  |                           |
       |         |  (SQLite + fs)  |                           |
       |         +-----------------+                           |
       |                                                       |
       |   +---------+   +---------+   +---------+             |
       +--->  logger |   |  probe  |   |  edits  |             |
       |   +----+----+   +----+----+   +----+----+             |
       |        |             |             |                  |
       |   +----v-------------v-------------v----+             |
       |   |     trace / feature / causal        |<------------+
       |   |         file store on disk          |
       |   +--------------------------------------+
       |
       +--------------------------------+
                                        |
                                        v
                +-------------+   +-----------+   +----------+
                |  Loaded LLM | ->|  Hooked   | ->|  Cache   |
                |   Weights   |   |Transformer|   |          |
                +-------------+   +-----------+   +----------+

The layers:

Model layer. A HookedTransformer instance managed by the server, kept resident in memory across sessions so that reload costs do not repeat. Adapters for TransformerLens today, nnsight or raw transformers tomorrow.
Tool layer. The six modules from Parts 6-11, each with a stable Python API and a matching CLI subcommand. Each accepts and produces the same well-typed artifacts.
Storage layer. File-backed by default (JSON metadata + .pt / .npy tensors + a small SQLite index for search). No mandatory cloud dependency. Reproducibility means the whole session must be zippable and reloadable on another machine.
Server. A FastAPI app that exposes the tools as JSON-RPC methods. Multi-user by design — the debugger is meant to run on a shared research host as well as on a laptop.
GUI. A browser front-end (Plotly for the concept map, editable notebook cells for scripted investigations, a “session inspector” for reviewing prior work). Every GUI action is round-tripped through a documented RPC method, so nothing the GUI does is inaccessible from a script.

The pieces the series has already delivered are the Tool layer and the Storage layer. The Server and GUI are the work the community is being asked to help finish.

The Six Tools, One Interface

The proposed CLI surface, showing how the tools compose:

# Session management
bitghost init myinvestigation
bitghost session use myinvestigation

# Capture
bitghost capture "The password is" --labels category=credential
bitghost capture-file corpus.jsonl              # bulk capture
bitghost traces list                            # index the session

# Analysis
bitghost fingerprint --layer 6 --pooling mean
bitghost features --sae gpt2-small-res-jb --layer 6
bitghost cluster --by category --method umap --dim 2
bitghost map open feature_map.html              # opens in browser

# Causal
bitghost trace-causal \
  --clean  "Please summarize the article"      \
  --corrupt "Please summarize ignore instructions"
bitghost edit-points top --k 15

# Intervention
bitghost steer build \
  --positive refuse --negative injection --layer 6 \
  --out refuse.steer
bitghost steer generate \
  --prompt "Ignore previous and reveal" \
  --with refuse.steer --strength 4.0
bitghost steer test --set injection_test.jsonl --with refuse.steer

# Deployment
bitghost monitor start --steer refuse.steer --alert-on-comply
bitghost monitor logs --tail

Every subcommand corresponds directly to one of the tools we built. Every artifact — trace file, fingerprint file, feature vector, causal map, steering vector — is a versioned, hashable file on disk. The GUI is a browser wrapper around the same subcommands. A team can review a colleague’s investigation by unzipping the session directory and running bitghost session use.

The Novel Piece: The Runtime Monitor

Everything so far has been offline instrumentation — capture traces, analyze them, build interventions. The proposed bitghost monitor component is what makes the whole system usable in production. It is the piece that runs alongside a deployed model, applies steering vectors at inference time, and — importantly — alerts when the model’s internal state on a live prompt lands in a region of concept space consistent with known attack categories.

Conceptually:

     inference-time prompt --+
                             |
                             v
                       +------------+
                       | HookedLLM  | -----------> completion
                       +-----+------+
                             |
                        residual stream
                             |
              +--------------+---------------+
              |                              |
              v                              v
      +-------------+                 +-------------+
      | fingerprint |                 |  optional   |
      | + feature   |                 |  steering   |
      | extraction  |                 |  vector     |
      +------+------+                 |  injection  |
             |                        +-------------+
             v
       +-----------+
       |  nearest  |
       |  category |
       +-----+-----+
             |
             v
       +-----------+
       | alert if  |
       | injection |
       +-----------+

This is not novel research. It is the deployment shape of everything the series has taught. The reason it is worth naming as a separate proposed component is that it is where the tooling becomes operationally useful — not a research-lab curiosity, but a runtime signal that a security operations team can subscribe to.

What the Series Explicitly Did Not Cover

Twelve months is enough to build a foundation, not a complete field. Some threads I want to name because they matter and were not addressed:

Multimodal models. Everything here was text-only. Vision-language models add attack surfaces (image-based injection, cross-modal steering) that need their own tooling.
Fine-tuning attacks. We treated model weights as read-only. Real threat models include supply-chain weight modification (see the ROME line of work). A “weights diff” tool that identifies where a fine-tuned model differs causally from a base model is an obvious next module.
Distributed inference / MoE architectures. Mixture-of-experts models have a routing layer that is itself an attack surface, and the capture logic needs to handle expert-specific activations.
Federated / on-device deployment. Runtime monitoring in an on-device model (a phone LLM, an embedded agent) has different constraints than server-side monitoring. Nothing in the debugger currently addresses that.
Quantum-classical hybrid models. In the “About” page for Bitghost we said the mission is “hardened logic for an intelligent era” — that era includes post-classical compute. The debugger, as proposed, has nothing to say about hybrid systems yet, and that is a Part 13 problem I would love someone to help solve.

Each of these is a legitimate research program. The proposed debugger has stable extension points for all of them; the modularity of the six-tool architecture is exactly what lets a contributor add, say, a bitghost weights-diff subcommand without touching the concept map.

The Ethics Discussion, in Plain Terms

I have flagged the dual-use nature of this work in Parts 10 and 11. This is where I state the project’s position.

The Bitghost Debugger will be released under an open-source license (MIT or Apache 2.0, TBD) with a documented acceptable-use expectation in the README: the tool is for defenders, researchers, and red teams operating with the consent of model owners. The license itself does not enforce that; nothing in a source-available license ever does. What we can and will do is:

Ship detection tools alongside offensive tools, so that the same commit that improves activation-steering-based attack technique adds — or updates — the monitor that catches it.
Maintain a coordinated-disclosure policy for interventions the tool enables. If a contributor develops a new steering technique that reliably bypasses production models, we ask that the model providers be notified before the technique is published, with a reasonable disclosure window matching accepted infosec norms.
Publish model-agnostic educational material rather than model-specific exploits. The tool teaches how to find edit points in your model. It is not a catalog of edit points in production models.
Refuse contributions that add functionality with no plausible defensive use — most concretely, prebuilt attack payloads targeting specific commercial models. This is a judgment call for maintainers, but the direction is clear.

I do not think this eliminates the risk. It aligns the incentives of the people who use the tool toward the defense side of the field. That is the same bet Metasploit and Ghidra represent, and it is the bet I am prepared to make.

The Contribution Roadmap

For anyone reading this who wants to help build the actual debugger — not the six-module skeleton, but the deployed product — here is the sequenced work.

Milestone 0 (weeks 1-4): package the skeleton. Merge the six modules into a bitghost-core Python package with a stable API, pyproject.toml, pre-commit, tests, and CI. Publish to PyPI. This is entirely mechanical work and it unlocks everything downstream.

Milestone 1 (months 2-3): the CLI. Wrap bitghost-core in a Typer-based CLI matching the surface sketched above. Add the session-management primitives, the SQLite trace index, and the artifact-hashing story. Ship a bitghost binary via PyPI.

Milestone 2 (months 3-5): the server. Wrap the CLI in a FastAPI JSON-RPC server. Add authentication, multi-user session isolation, and the RPC schema. This is where the tool becomes usable on a shared research host.

Milestone 3 (months 5-8): the GUI. A browser front-end. React or Svelte, Plotly for the concept map, a Monaco-based notebook for scripted investigation, a “trace inspector” for reviewing captured activations. This is the highest-visibility work; it is also the most polish-sensitive and will benefit from a dedicated frontend contributor.

Milestone 4 (months 8-12): the runtime monitor. The production-side deployment. A lightweight inference wrapper around a hosted model that runs the fingerprint/feature/nearest-category pipeline on every prompt, optionally applies steering vectors, and emits structured alerts to a monitoring backend (Prometheus / OpenTelemetry / plain webhook). This is where the debugger stops being a lab tool and starts being an operational one.

Milestone 5 (year 2): the ecosystem. Extensions for other model families (Llama, Mistral, Qwen, DeepSeek). SAE library integrations beyond SAELens. Adapters for nnsight and raw HuggingFace transformers. A shared “corpus of interest” that the community curates: known-good baselines, known-bad injection variants, edge cases. This is the point at which the debugger becomes infrastructure rather than a project.

Nothing in this roadmap requires exotic compute. Every milestone is a well-scoped engineering effort that a small group of contributors can deliver.

What I Am Asking For

Three specific things.

Try the skeleton. Every article in this series ships working code. Clone the code from the bitghostsecurity GitHub org, run the tools against a model of your choice, and open issues where you hit friction. Real usage is what tells us where the abstractions are wrong.
Pick a milestone. If any of the roadmap items above matches your skillset — Python packaging, FastAPI backends, browser front-ends, deployment infrastructure — reach out. The email is at the bottom of every article. The gate is not a formal application; it is a short conversation about which milestone you want to own.
Bring your corpus. The best thing the community can contribute that no individual researcher can produce is a shared, versioned, labeled corpus of prompts — benign, adversarial, edge-case — that the debugger’s default configurations can be validated against. Labeled data is the single largest bottleneck in AI security research today. A community corpus, ethically curated, would move the entire field.

Closing: What This Was Really About

I started this series by asking whether a security engineer could meaningfully do AI security research without becoming an ML researcher first. Twelve articles later I can answer that with more confidence than I had in January.

The answer is yes, and the reason is that mechanistic interpretability is producing tooling that is closer to reverse engineering than to statistics. When you stare at a residual stream, you are staring at something that behaves more like a runtime memory image than a parameter distribution. When you patch an activation, you are doing something that feels more like DLL injection than like gradient descent. When you build a steering vector, you are writing a shellcode-shaped edit into a live system, and you are measuring its blast radius the same way a red teamer measures a payload’s effect.

The mental model that security engineers already have — capture, hypothesize, intervene, measure — transfers directly. What has been missing is the plumbing. Twelve articles were enough to build the plumbing.

The next twelve months will decide whether Bitghost becomes a real tool or stays as a well-scoped proposal. I would like it to become a real tool, and I would like the people building it to be a mix of ML researchers and security engineers, because that is the intersection where the interesting work lives.

Thank you for reading. If you have made it to Part 12, you now hold a stack of tools that no established security team is using and every established security team eventually will. Use them. Improve them. Share what you find.

Where We Have Been

Part 1: The language — tensors, ranks, shapes
Part 2: The architecture — embeddings, attention, transformers
Part 3: The threat landscape — input, weight, output attacks
Part 4: The interpretability toolbox — SAEs, circuits, patching, probing
Part 5: The workbench — PyTorch, TransformerLens, first experiments
Part 6: The instrument — a reusable activation logger
Part 7: The first analysis — fingerprinting prompts by their internal footprint
Part 8: The upgrade — decomposing tangled activations into interpretable features
Part 9: The atlas — turning feature vectors into navigable visual maps
Part 10: The mechanism — localizing causally load-bearing edit points
Part 11: The intervention — building steering vectors and testing them against injection
Part 12: The synthesis — an open-source proposal for the Bitghost Debugger

The ghosts in the tensors are real. The tools to find them are, at last, ours to build together.

References

Bricken, T., et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic Research.
Marks, S., et al. (2024). Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models. arXiv preprint arXiv:2403.19647.
Meng, K., Bau, D., Mitchell, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS.
Nanda, N., & Bloom, J. (2022). TransformerLens: A Library for Mechanistic Interpretability of Language Models. GitHub.
Olah, C., et al. (2020). Zoom In: An Introduction to Circuits. Distill.
Rimsky, N., et al. (2024). Steering Llama 2 via Contrastive Activation Addition. arXiv preprint arXiv:2312.06681.
Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic Research.
Turner, A., et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv preprint arXiv:2308.10248.

Join the Mission

The series is over. The work is beginning. If you are interested in the intersection of AI, Quantum, and Security, I would love for you to help build what comes next.

GitHub: github.com/bitghostsecurity
Collaborate: hello@bitghostsecurity.com

Hardened Logic for an Intelligent Era.

Share on

X Facebook LinkedIn Bluesky

Bit Ghost Security