Part 2: How LLMs Actually Think – Embeddings, Attention, and the Math Behind the Magic
This is Part 2 of a 12-part series exploring the intersection of artificial intelligence and cybersecurity. In Part 1, we built our foundation — understanding tensors as the fundamental data structure of neural networks. Now we go deeper: how do these tensors actually flow through a model to produce language?
From Words to Numbers: The Embedding Layer
In Part 1, I made a claim that might have seemed abstract: “the intelligence of the model is not in the code — it is in the tensors.” Now it is time to see that claim in action.
When you type a prompt into ChatGPT, Claude, or any large language model, the first thing that happens is deceptively simple but profoundly important: your words get converted into numbers. This process is called tokenization and embedding, and it is the front door to the entire system.
Tokenization: Breaking Language into Pieces
Before the math can begin, your text has to be broken into discrete units called tokens. A token might be a whole word (“cat”), a subword (“un” + “break” + “able”), or even a single character. Modern LLMs typically use a technique called Byte-Pair Encoding (BPE), introduced by Sennrich, Haddow, and Birch (2016), which learns a vocabulary of common subword units from the training data.
Why subwords instead of whole words? Because language is combinatorial. A vocabulary of whole words would need to be enormous to cover every possible word form, technical term, and neologism. BPE gives the model a compact vocabulary (typically 32,000 to 100,000 tokens) that can represent virtually any text by combining pieces.
For a security engineer, this is the first place to pay attention. The tokenizer is a trust boundary. It determines how the model “sees” your input. Research has shown that adversarial inputs can exploit tokenization quirks — unusual Unicode characters, zero-width spaces, or homoglyph substitutions can produce token sequences that bypass safety filters while appearing identical to human readers (Boucher et al., 2022).
The Embedding Matrix: Words as Coordinates
Once text is tokenized, each token ID is used to look up a row in the embedding matrix — a massive rank-2 tensor where each row is a high-dimensional vector representing a single token. In GPT-3, this matrix has dimensions of 50,257 × 12,288. That is over 617 million parameters just in the embedding layer alone.
import torch
# Simulating a small embedding lookup
vocab_size = 50257
embedding_dim = 12288
# The embedding matrix (in practice, this is learned during training)
embedding_matrix = torch.randn(vocab_size, embedding_dim)
# Token ID for the word "security" (hypothetical)
token_id = 6373
word_vector = embedding_matrix[token_id]
print(f"Embedding shape: {word_vector.shape}")
# Output: torch.Size([12288])
# This single word is now a point in 12,288-dimensional space
What makes embeddings powerful is that they are learned representations. During training, the model adjusts these vectors so that semantically related words end up near each other in this high-dimensional space. Mikolov et al. (2013) demonstrated this beautifully with Word2Vec, but modern contextual embeddings go much further — the same word can have different embeddings depending on context.
Consider the word “port.” In networking, it means a communication endpoint. In logistics, it means a harbor. In computing, it means to adapt software. The embedding layer of a transformer does not give “port” a single fixed vector — instead, the initial embedding gets refined through subsequent layers based on the surrounding context. This contextual sensitivity was the key innovation of models like ELMo (Peters et al., 2018) and later BERT (Devlin et al., 2019).
Positional Encoding: Teaching Order to Tensors
Here is a subtle but critical problem: a matrix multiplication does not care about order. If you multiply the same set of word vectors by the same weight matrix, the result is the same regardless of word order. But “the dog bit the man” and “the man bit the dog” have very different meanings.
Transformers solve this with positional encoding — adding a special tensor to each word’s embedding that encodes its position in the sequence. Vaswani et al. (2017) used a clever approach: sinusoidal functions of different frequencies.
import torch
import math
def positional_encoding(seq_length, d_model):
"""Generate positional encodings using sine and cosine functions."""
pe = torch.zeros(seq_length, d_model)
position = torch.arange(0, seq_length).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
# For a sequence of 512 tokens with 768-dimensional embeddings
pe = positional_encoding(512, 768)
print(f"Positional encoding shape: {pe.shape}")
Each position gets a unique “fingerprint” of sine and cosine values. The beauty of this approach is that the model can learn to attend to relative positions — “the word 3 positions ago” — not just absolute positions.
From a security perspective, positional encodings are another subtle attack surface. Research on positional encoding manipulation has shown that carefully crafted position perturbations can cause models to misinterpret the order of instructions, potentially causing a model to prioritize injected text over the user’s actual prompt (Perez & Ribeiro, 2022).
Self-Attention: The Heart of the Transformer
Now we arrive at the mechanism that changed everything. Self-attention is the operation that allows each word in a sequence to “look at” every other word and decide how much to pay attention to it. This is what gives transformers their remarkable ability to capture long-range dependencies in text.
The Query, Key, Value Framework
Self-attention works through three learned projections: Queries (Q), Keys (K), and Values (V). Think of it like a search engine:
- The Query is what you are looking for — “I need context about this word.”
- The Key is the label on each other word — “Here is what I offer.”
- The Value is the actual content — “Here is my information if you decide I am relevant.”
The mathematical operation is elegant:
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]Let me break this down:
- QK^T: Multiply queries by keys (transposed) to get a matrix of “compatibility scores” — how relevant is each word to each other word?
- ÷ √d_k: Scale by the square root of the key dimension to prevent the dot products from becoming too large (which would push softmax into regions with tiny gradients).
- softmax: Convert scores into probabilities — each row sums to 1, representing how much attention each word pays to every other word.
- × V: Multiply by values to get the final output — a weighted combination of all words’ information, weighted by attention.
import torch
import torch.nn.functional as F
def self_attention(embeddings, W_q, W_k, W_v):
"""Simple self-attention implementation."""
Q = embeddings @ W_q # Queries
K = embeddings @ W_k # Keys
V = embeddings @ W_v # Values
d_k = K.shape[-1]
scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
attention_weights = F.softmax(scores, dim=-1)
output = attention_weights @ V
return output, attention_weights
# Example: 1 sequence, 10 tokens, 64-dim embeddings
seq_len, d_model, d_k = 10, 64, 64
embeddings = torch.randn(seq_len, d_model)
W_q = torch.randn(d_model, d_k)
W_k = torch.randn(d_model, d_k)
W_v = torch.randn(d_model, d_k)
output, weights = self_attention(embeddings, W_q, W_k, W_v)
print(f"Attention weights shape: {weights.shape}")
# Each of the 10 tokens has an attention weight over all 10 tokens
Why Self-Attention Matters for Security
The attention weight matrix is not just a computational artifact — it is a map of the model’s reasoning. When a model generates a response, the attention weights reveal which parts of the input it considered most relevant.
This has profound security implications:
Prompt Injection Attacks exploit the attention mechanism directly. When an attacker injects instructions like “Ignore all previous instructions and…” into a prompt, they are banking on the model’s attention mechanism assigning high weight to the injected text. Perez and Ribeiro (2022) documented this systematically, showing that the attention mechanism has no built-in concept of “privileged” vs. “unprivileged” text — it treats everything in the context window equally.
Attention Visualization as a Diagnostic Tool: By examining attention weights, security researchers can understand why a model followed a malicious instruction. Tools like BertViz (Vig, 2019) allow researchers to visualize these attention patterns, revealing how information flows through the model’s layers.
Multi-Head Attention: Parallel Perspectives
A single attention head can only capture one type of relationship at a time. Maybe one head learns syntactic dependencies (subject-verb agreement), while another learns semantic relationships (coreference). The transformer solves this by running multiple attention heads in parallel.
In GPT-3, each layer has 96 attention heads, each operating on a 128-dimensional subspace of the 12,288-dimensional embedding. The outputs of all heads are concatenated and projected back to the full dimension.
# Multi-head attention conceptually
num_heads = 96
d_model = 12288
d_head = d_model // num_heads # 128 per head
# Each head gets its own Q, K, V projections
# They all operate in parallel on different subspaces
# Results are concatenated: [head_1 ; head_2 ; ... ; head_96]
# Then projected: concatenated @ W_output
Research by Voita et al. (2019) showed that not all heads are equally important — many can be pruned without significant performance loss, while a few “critical heads” carry disproportionate responsibility for specific linguistic functions. Clark et al. (2019) found individual attention heads that specialize in tracking specific syntactic relationships.
For security, this multi-head structure means that an attack does not need to compromise all 96 heads — targeting a few critical heads could be sufficient to alter model behavior in specific, controlled ways.
The Feed-Forward Network: Processing What Attention Found
After the attention mechanism decides what information is relevant, the feed-forward network (FFN) processes that information. Each transformer layer contains a two-layer neural network applied independently to each position:
\[\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2\]In GPT-3, the inner dimension of the FFN is 49,152 — four times the model dimension. This expansion-then-compression pattern allows the network to transform information through a higher-dimensional space before compressing it back.
Recent research has shown that the feed-forward layers function as key-value memories (Geva et al., 2021). Each row of the first weight matrix acts as a “key” that matches specific input patterns, and the corresponding row of the second weight matrix stores the associated “value” — the information that should be retrieved when that pattern is detected. This finding is revolutionary because it suggests that FFN layers are not just generic transformations but are structured stores of factual knowledge.
For security engineers, this means that facts the model “knows” — including potentially sensitive training data — are localized in specific rows of specific weight tensors. This opens the door to both targeted knowledge extraction (an offensive concern) and targeted knowledge editing (a defensive tool).
Layer Normalization and Residual Connections: Stability in Depth
Modern transformers stack dozens or even hundreds of these attention-plus-FFN blocks. Getting gradients to flow cleanly through such deep networks requires two critical techniques:
Residual Connections (He et al., 2016): The input to each sublayer is added to its output. This creates “skip connections” that allow gradients to flow directly through the network during training, preventing the vanishing gradient problem.
Layer Normalization (Ba, Lei, & Hinton, 2016): The activations at each layer are normalized to have zero mean and unit variance. This stabilizes training and allows higher learning rates.
# Simplified transformer block
def transformer_block(x, attention, ffn, norm1, norm2):
# Self-attention with residual connection and layer norm
attn_output = attention(norm1(x))
x = x + attn_output # Residual connection
# Feed-forward with residual connection and layer norm
ffn_output = ffn(norm2(x))
x = x + ffn_output # Residual connection
return x
These residual connections have a security-relevant property: they create a “residual stream” that carries information from earlier layers directly to later layers. Elhage et al. (2021) from Anthropic described this as a “communication channel” that different layers can read from and write to. Understanding this stream is essential for understanding how information — including injected adversarial content — propagates through the model.
Putting It All Together: A Token’s Journey
Let me trace the complete journey of a single token through a transformer, from input to output:
- Tokenization: “Security” → token ID 14354
- Embedding Lookup: ID 14354 → 12,288-dimensional vector
- Positional Encoding: Add position information to the embedding
- Layer 1 Attention: Attend to all other tokens, aggregate context
- Layer 1 FFN: Transform the attended representation, retrieve relevant knowledge
- Residual + Norm: Stabilize and pass forward
- Layers 2-96: Repeat, building increasingly abstract representations
- Final Layer Norm: Normalize the output
- Unembedding: Project back to vocabulary size (50,257 dimensions)
- Softmax: Convert to probability distribution over next token
At step 9, the model has transformed our input token — through 96 layers of attention and feed-forward operations — into a prediction about what comes next. The entire process is a cascade of tensor operations: matrix multiplications, element-wise operations, and softmax normalizations.
The Security Engineer’s Takeaway
After working through this, I want you to walk away with three key insights:
First, the transformer architecture has no built-in security model. There is no concept of “trusted input” vs. “untrusted input.” The attention mechanism treats every token in the context window as a potential source of relevant information. This is a fundamental architectural property, not a bug to be patched.
Second, the model’s knowledge and behavior are distributed across billions of tensor values, but not uniformly. Specific attention heads specialize in specific tasks. Specific FFN rows store specific facts. This structure-within-chaos means that targeted attacks on model weights are not just theoretically possible — they are architecturally plausible.
Third, understanding these mechanics gives security engineers a vocabulary and a framework for reasoning about AI threats. When someone says “prompt injection,” you now know they are describing an attack on the attention mechanism. When someone says “data poisoning,” you know they are describing the corruption of weight tensors during training.
What’s Coming Next
In Part 3, we will take everything we have learned and turn it toward offense. We will map the attack surface within — examining how each component of the transformer architecture presents unique vulnerabilities. From adversarial embeddings to attention hijacking to weight-level tampering, we will build a taxonomy of AI-specific threats that goes far beyond the “jailbreak prompt” headlines.
The math is the map. Now we start planning the mission.
References
- Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv preprint arXiv:1607.06450.
- Boucher, N., Shumailov, I., Anderson, R., & Papernot, N. (2022). Bad Characters: Imperceptible NLP Attacks. IEEE Symposium on Security and Privacy (S&P).
- Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What Does BERT Look At? An Analysis of BERT’s Attention. BlackboxNLP Workshop at ACL.
- Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT.
- Elhage, N., et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic Research.
- Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR.
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS.
- Perez, F., & Ribeiro, I. (2022). Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs through a Global Scale Prompt Hacking Competition. arXiv preprint arXiv:2311.16119.
- Peters, M. E., et al. (2018). Deep contextualized word representations. NAACL-HLT.
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL.
- Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
- Vig, J. (2019). A Multiscale Visualization of Attention in the Transformer Model. ACL System Demonstrations.
- Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting. ACL.
Join the Mission
This is just the beginning. I will be sharing my code, data, and research findings as I go. If you are interested in the intersection of AI, Quantum, and Security, I’d love to connect.
- GitHub: github.com/bitghostsecurity
- Collaborate: hello@bitghostsecurity.com
Hardened Logic for an Intelligent Era.