Part 3: The Attack Surface Within – Where Tensors Meet Vulnerabilities
This is Part 3 of a 12-part series exploring the intersection of artificial intelligence and cybersecurity. In Part 1, we learned what tensors are. In Part 2, we traced how they flow through a transformer. Now we turn everything we have learned into a threat model.
Rethinking the Attack Surface
In traditional cybersecurity, we have a well-established methodology for analyzing attack surfaces. We identify entry points, map trust boundaries, catalog data flows, and assess each component for potential vulnerabilities. OWASP, MITRE ATT&CK, and STRIDE have given us frameworks that work brilliantly for conventional software.
But AI systems do not fit neatly into these frameworks. The “code” is generic. The “data” is the model. The “logic” is a mathematical function with billions of parameters that no human fully understands. When I first started mapping the attack surface of a large language model, I realized I needed to think differently — not just about where attacks happen, but about what dimension they operate in.
After months of research, I have come to organize AI attack surfaces into three distinct layers:
- The Input Layer — attacks on what the model receives
- The Weight Layer — attacks on what the model is
- The Output Layer — attacks on what the model produces
Let me walk through each one, building on the tensor mathematics we covered in Parts 1 and 2.
Layer 1: Input Attacks — Manipulating the Embedding Space
The input layer is where most current AI security research is focused, and for good reason — it is the most accessible attack surface. You don’t need access to model weights or training infrastructure. You just need a prompt.
Prompt Injection: Hijacking Attention
We discussed in Part 2 how the self-attention mechanism treats every token in the context window equally. There is no privilege separation between system prompts, user instructions, and injected content. Prompt injection exploits this architectural reality.
Greshake et al. (2023) formalized this taxonomy in their landmark paper on indirect prompt injection, identifying two categories:
Direct Prompt Injection: The attacker directly crafts input to override the system prompt. Techniques include:
- Instruction override: “Ignore all previous instructions and…”
- Context manipulation: Framing requests as hypothetical scenarios or roleplay
- Delimiter confusion: Using formatting markers (```, —, etc.) to create fake system prompts
Indirect Prompt Injection: The attacker places malicious instructions in content the model will later retrieve — websites, documents, emails, or database entries. When the model processes this content through its attention mechanism, the injected instructions compete with legitimate instructions for attention weight.
The mathematical reality is stark. In the attention equation:
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]There is nothing that differentiates a “trusted” key-value pair from an “untrusted” one. The softmax function distributes attention based purely on the dot-product similarity between queries and keys. An injected instruction that produces key vectors highly aligned with the model’s query vectors will receive disproportionate attention — regardless of its source.
Adversarial Examples: Precision Perturbations
Beyond text-based prompt injection, there is a deeper class of input attacks that operates directly on the tensor representations. Goodfellow, Shlens, and Szegedy (2015) introduced the Fast Gradient Sign Method (FGSM), demonstrating that neural networks are systematically vulnerable to small, carefully computed perturbations.
The core idea is elegant and terrifying:
import torch
def fgsm_attack(model, input_tensor, target, epsilon=0.01):
"""
Fast Gradient Sign Method - compute the direction that
maximizes the model's loss, then take a small step.
"""
input_tensor.requires_grad = True
output = model(input_tensor)
loss = torch.nn.functional.cross_entropy(output, target)
loss.backward()
# The sign of the gradient tells us which direction to perturb
perturbation = epsilon * input_tensor.grad.sign()
adversarial_input = input_tensor + perturbation
return adversarial_input
The perturbation is tiny — often imperceptible to humans — but it exploits the high-dimensional geometry of the model’s decision boundaries. Carlini and Wagner (2017) later showed that even more powerful attacks could be crafted by solving an optimization problem that finds the minimum perturbation needed to cause misclassification.
For text-based models, adversarial examples take different forms. Ebrahimi et al. (2018) demonstrated HotFlip, which computes character-level perturbations using gradient information to find the single character change that maximally disrupts model predictions. A single character swap — “security” to “security” — can cascade through the embedding and attention layers with surprising effects.
Embedding Space Attacks
A more sophisticated class of input attacks operates directly in the embedding space. Rather than crafting adversarial text, the attacker crafts adversarial embeddings — tensor values that do not correspond to any natural text but that steer the model’s computation in attacker-controlled directions.
Zou et al. (2023) demonstrated this with their Greedy Coordinate Gradient (GCG) attack, which finds adversarial suffixes that cause aligned language models to generate harmful content. The attack works by optimizing in the token space to find sequences that produce embedding vectors pointing toward the model’s “harmful content” region — even though the adversarial suffixes look like gibberish to humans.
# Example GCG adversarial suffix (these look like nonsense)
# "describing.-- reps$LT+ Nach()", appended to a harmful query,
# can cause an aligned model to comply
This is fundamentally a tensor-space attack. The adversarial suffix is chosen not for its semantic content but for the geometric properties of its embedding vectors relative to the model’s internal decision boundaries.
Layer 2: Weight Attacks — Corrupting the Model Itself
This is the layer that I find most fascinating and most under-explored. If input attacks are like social engineering — manipulating what the model receives — weight attacks are like firmware rootkits — modifying what the model is.
Training-Time Attacks: Data Poisoning
The most well-studied weight attack is data poisoning, where an attacker corrupts the training data to embed malicious behaviors into the model’s learned weights.
Gu et al. (2019) demonstrated BadNets — neural networks with hidden backdoors. By adding a small number of poisoned examples to the training data (images with a specific pixel pattern labeled as the attacker’s target class), they produced models that behaved normally on clean inputs but responded to the backdoor trigger with attacker-chosen behavior.
The backdoor is not a separate piece of code injected into the model. It is encoded in the weight tensors themselves — specific patterns of values across specific layers that create an alternative computational pathway activated only by the trigger. Chen et al. (2017) extended this to show that backdoors survive fine-tuning, transfer learning, and even model pruning, because the backdoor patterns become deeply intertwined with the model’s legitimate knowledge.
More recently, Wan et al. (2023) demonstrated instruction-following backdoors in LLMs, where poisoned training data causes the model to follow attacker-specified instructions when a trigger phrase is present. The implications are chilling: a model could pass every standard evaluation benchmark while harboring hidden behaviors waiting to be activated.
Inference-Time Weight Manipulation
If an attacker gains access to the model’s weight files after training, they can directly modify the tensors without needing to retrain. Model weights are typically stored in standard formats:
- PyTorch:
.ptor.binfiles (Python pickle serialized tensors) - TensorFlow:
.h5or SavedModel directories - ONNX:
.onnx(Open Neural Network Exchange format) - SafeTensors:
.safetensors(Hugging Face’s secure format)
import torch
# Loading and inspecting model weights is trivial
model_weights = torch.load("model.pt", map_location="cpu")
# Each key is a layer name, each value is a tensor
for name, tensor in model_weights.items():
print(f"{name}: shape={tensor.shape}, dtype={tensor.dtype}")
# Modifying a weight is as simple as:
# model_weights['layer.0.attention.W_q'] += perturbation_tensor
# torch.save(model_weights, "model_modified.pt")
The PyTorch .pt format is particularly concerning because it uses Python’s pickle serialization, which can execute arbitrary code during deserialization. Safetensors was developed specifically to address this — it is a pure data format with no code execution capability. But the security community has been slow to adopt it, and millions of models on Hugging Face still use pickle-based formats.
Even with safe formats, the weights themselves can be tampered with. Hong et al. (2022) showed that modifying fewer than 0.01% of a model’s parameters — targeted at specific neurons identified through gradient analysis — could embed backdoors without affecting the model’s performance on standard benchmarks.
Supply Chain Attacks on Model Weights
The AI supply chain is alarmingly similar to the software supply chain of a decade ago — before SolarWinds made everyone take it seriously. Models are downloaded from public repositories, often without cryptographic verification. The typical workflow:
- Researcher uploads model to Hugging Face
- Developer downloads model with
transformers.AutoModel.from_pretrained("model-name") - Model weights are loaded and executed
There is no code signing for model weights. There is no SBOM (Software Bill of Materials) for training data. There is no reproducible build process for most models. An attacker who compromises a popular model repository could distribute poisoned weights to thousands of downstream applications.
Goldblum et al. (2022) surveyed this landscape comprehensively, coining the term “dataset security” and arguing that the ML pipeline’s reliance on unverified data and models creates a systemic vulnerability comparable to the early days of open-source software distribution.
Layer 3: Output Attacks — Exploiting Model Responses
The output layer is where model vulnerabilities become user-facing. Even if the model’s weights are pristine and the input is legitimate, the output can be weaponized.
Information Extraction and Memorization
Large language models memorize portions of their training data — not approximately, but verbatim. Carlini et al. (2021) demonstrated that GPT-2 could be prompted to regurgitate exact sequences from its training data, including personally identifiable information, code snippets, and copyrighted text.
The memorization is encoded in the weight tensors, particularly in the feed-forward layers that Geva et al. (2021) identified as key-value memories. When the right query hits the right “key” neurons, the associated “value” — potentially a memorized training example — gets surfaced in the output.
# Conceptual: extracting memorized content
# Prompt the model with a known prefix from the training data
prefix = "My social security number is"
# A model that memorized SSNs from its training data might complete this
# with an actual SSN from the training set
This is not a theoretical concern. Nasr et al. (2023) showed that ChatGPT could be induced to emit training data at scale using a simple repeated-word prompting technique, extracting megabytes of memorized text.
Hallucination as a Vulnerability
Model hallucination — generating confident but factually incorrect output — is typically discussed as a reliability problem. But from a security perspective, hallucinations are an integrity vulnerability.
When an LLM generates a response that cites non-existent research papers, recommends packages with subtly wrong names, or produces code with plausible-looking but incorrect security implementations, it is creating a trusted-source illusion. Users who trust the model’s output may act on false information.
Particularly concerning is the “package hallucination” attack surface identified by Lanyado et al. (2023). When LLMs recommend software packages that do not exist, attackers can register those package names and populate them with malicious code. The model becomes an unwitting accomplice in a supply chain attack.
Model Inversion and Membership Inference
Fredrikson et al. (2015) demonstrated model inversion attacks, where an attacker uses a model’s outputs to reconstruct its training inputs. By observing a facial recognition model’s confidence scores across many queries, they could reconstruct recognizable images of individuals in the training set.
Membership inference attacks (Shokri et al., 2017) are a related threat: given a data point, can an attacker determine whether it was in the model’s training set? This has direct privacy implications — if I can determine that your medical records were used to train a diagnostic model, that constitutes a privacy breach even if I cannot recover the records themselves.
Both attacks exploit the fact that models behave subtly differently on training data versus unseen data. The weight tensors carry a statistical “signature” of the training set, and that signature leaks through the model’s outputs.
Building a Unified AI Threat Model
Let me bring this together into a framework that security engineers can use:
| Attack Layer | Attack Type | Access Required | Detectability | Impact |
|---|---|---|---|---|
| Input | Prompt Injection | None (API access) | Low-Medium | Behavioral manipulation |
| Input | Adversarial Examples | None-Low | Very Low | Misclassification |
| Input | Embedding Manipulation | Model internals | Low | Arbitrary behavior |
| Weight | Data Poisoning | Training pipeline | Very Low | Persistent backdoor |
| Weight | Direct Weight Editing | Model file access | Low | Arbitrary modification |
| Weight | Supply Chain | Repository access | Very Low | Mass compromise |
| Output | Data Extraction | API access | Medium | Privacy breach |
| Output | Hallucination Exploit | None | High | Integrity compromise |
| Output | Model Inversion | API access | Low | Training data recovery |
Notice a pattern: the most dangerous attacks (weight-level) require more access but are nearly undetectable, while the most accessible attacks (prompt injection) are easier to detect but harder to prevent architecturally.
The Defender’s Dilemma
Here is what keeps me up at night as a security engineer: we are trying to secure a system whose decision-making process we do not fully understand.
In traditional software, when we find a vulnerability, we can trace the exact code path that leads to the exploit. We can write a patch that addresses the root cause. We can verify the fix with a test.
In neural networks, the “code path” is a cascade of tensor multiplications across billions of parameters. There is no line of code to patch. The “logic” is emergent, arising from the statistical patterns encoded in the weight tensors. Patching a vulnerability might mean retraining the entire model — or surgically editing specific tensor values, if we can even identify which ones.
This is why mechanistic interpretability — the subject of Part 4 — is not just an academic curiosity. It is the foundation of AI defense. You cannot defend what you cannot understand, and right now, we are defending systems that are largely opaque even to their creators.
What’s Coming Next
In Part 4, we will explore Mechanistic Interpretability — the emerging field of reverse-engineering neural networks to understand why they behave the way they do. We will learn how researchers are decomposing models into interpretable circuits, identifying the specific tensor values responsible for specific behaviors. For security engineers, this is the equivalent of learning to read disassembly — it is how we move from black-box testing to white-box analysis of AI systems.
References
- Carlini, N., & Wagner, D. (2017). Towards Evaluating the Robustness of Neural Networks. IEEE Symposium on Security and Privacy (S&P).
- Carlini, N., et al. (2021). Extracting Training Data from Large Language Models. USENIX Security Symposium.
- Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arXiv preprint arXiv:1712.05526.
- Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2018). HotFlip: White-Box Adversarial Examples for Text Classification. ACL.
- Fredrikson, M., Jha, S., & Ristenpart, T. (2015). Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. ACM CCS.
- Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP.
- Goldblum, M., et al. (2022). Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Goodfellow, I., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. ICLR.
- Greshake, K., et al. (2023). Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. AISec Workshop at ACM CCS.
- Gu, T., Liu, K., Dolan-Gavitt, B., & Garg, S. (2019). BadNets: Evaluating Backdooring Attacks on Deep Neural Networks. IEEE Access.
- Hong, S., et al. (2022). Handcrafted Backdoors in Deep Neural Networks. NeurIPS.
- Lanyado, B., et al. (2023). Can You Trust Your Model’s Recommendations? An Analysis of Package Hallucination by LLMs. Vulcan Cyber Research.
- Nasr, M., et al. (2023). Scalable Extraction of Training Data from (Production) Language Models. arXiv preprint arXiv:2311.17035.
- Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership Inference Attacks Against Machine Learning Models. IEEE S&P.
- Wan, A., et al. (2023). Poisoning Language Models During Instruction Tuning. ICML.
- Zou, A., et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043.
Join the Mission
This is just the beginning. I will be sharing my code, data, and research findings as I go. If you are interested in the intersection of AI, Quantum, and Security, I’d love to connect.
- GitHub: github.com/bitghostsecurity
- Collaborate: hello@bitghostsecurity.com
Hardened Logic for an Intelligent Era.