Ch.01
Transformer 1: Self-Attention and Parallelization
The heart of a Transformer model is Self-Attention, Add & Norm (residual connection and layer normalization) that keeps training stable, and a Feed Forward neural network that transforms gathered information deeply. If older models read tokens one by one and often lose earlier context, Transformers process the entire sentence at once like a top-down overview. In this chapter, you will learn the attention mechanism with Query, Key, and Value, and the intuitive roles of Add & Norm and Feed Forward that help the model learn deeply and reliably.
Understanding the core formulas
, , where is the input embeddings and are learnable projection matrices. This step splits the same text into "query-like", "key-like", and "value-like" representations. is the token-to-token relevance score matrix. Larger scores mean stronger relationships, but when the dimension is large the raw values can grow too much—so we divide by to stabilize them. produces a weight matrix whose rows sum to 1, meaning each token decides how much it should attend to others. is the final context representation created by mixing values using weights . The key is that it is a weighted average based on importance, not a plain average.
Self-attention is an operation where each token looks at all tokens and reconstructs context.
Concept structure: Q/K/V → scores → normalization → weighted sum
Weak referenceMedium referenceStrong reference
Transformer 1: See Self-Attention at a Glance
Concept Explanation: The Eye for Context
Self-attention lets every token look at all other tokens at the same time, and it assigns weights to decide how much each token should influence the meaning of the current token. For example, in the phrase "I went to the bank", self-attention can infer whether "bank" refers to a financial place or a riverbank by looking at surrounding words at the same time.
Intuition: Query (Q), Key (K), Value (V)
Think of it like searching in a library.
1. Query (Q) is what you type: the question you want to answer.
2. Key (K) is the information on book labels: what each token contains as searchable features.
3. Value (V) is the actual content you get.
Self-attention scores how well Q matches K, then mixes V according to the scores to produce the updated meaning.
Mathematical Explanation: Scaled Dot-Product Attention
Let the input embeddings be a matrix . We project them into , , and using learnable matrices. Attention scores come from the dot product . When the dimension is large, dot-product values can become too big, so we divide by for scaling. After applying softmax, we obtain weights . The final output is . Here, is the key dimension, and is the weight matrix that tells how much each token attends to others.
Real ML Example: Smarter Sentence Understanding
In spam filtering, words like "free" and "click" may be far apart, but self-attention captures their strong relationship at once and helps decide whether a message is spam. In medical document classification, it can connect symptoms, lab results, and negations like "no" in the same step to reduce false diagnoses.
Why it matters
Concept Explanation: Perfectly Solving Long-Range Dependency
Self-attention is effective because it can capture long-range dependency directly. When a word at the beginning of a sentence influences the meaning at the end, self-attention connects them without losing information in the middle.
Intuition: Relay (RNN) vs. Group Chat (Self-Attention)
RNN-style processing passes information like a relay race, so information can weaken as it moves step by step. Self-attention is like a group chat: everyone sees all messages at the same time, so distant information is available immediately.
Mathematical Explanation: Shorter Information Path
In RNNs, the information path length grows with the token distance (about ), which makes gradient flow harder for long distances. In self-attention, all tokens are connected in a single step, so the path length is about . Short paths make training more stable and help preserve important dependencies.
Real ML Example: Summarizing Long Texts
The advantage is most obvious in tasks like summarizing long legal documents or analyzing long chat logs where key points at the beginning must connect to conclusions at the end.
How it is used
Practical Assembly: Pre-work and a Conveyor Belt
In real systems, you prepare text before feeding it into the model. First, split the sentence into token pieces and convert them to vectors (embeddings). Then add positional encoding so the model knows the order—because attention itself behaves like a "set" operation that ignores order. Positional encoding acts like a tiny tag such as "this is the k-th token". After that, the data goes through a big transformer block: [multi-head attention → Add & Norm → feed-forward → Add & Norm]. Repeating this block 12 times leads to a BERT-Base style model, while repeating 96 times reaches GPT-3 scale.
Multi-Head Attention: Expert Committee Division
Instead of relying on one perspective, you use an expert committee: multiple heads in parallel. For example, if you have 512-dimensional vectors, split them into 8 heads so each head processes 64-dimensional slices from its own viewpoint. One head may focus on grammatical relations like "who did what", another on sentiment nuances like positive vs. negative cues, and another on named entities such as people and places. Each head computes attention using . After that, concatenate the 64-dim outputs from all heads (8 heads) and project back to 512 dimensions to form a rich representation.
Feed Forward + Activation: Microscopic Observation and Feature Extraction
Once attention has identified relationships between tokens, the feed-forward network (FFNN) updates each token representation independently and deeply. In practice, it often uses an hourglass structure: expand the dimension (e.g., 512 → 2048) to observe fine-grained patterns, then apply a non-linear activation such as ReLU or GELU, and finally compress back to the original size. Using a formula: . This helps the model move beyond simple pattern matching and learn complex concepts.
Code and Framework Usage: Hyperparameter Tuning
All this complex math is packaged in frameworks like PyTorch and TensorFlow as a single building block: `nn.TransformerEncoderLayer`. Practitioners usually do not derive everything from scratch; instead they tune key hyperparameters. You choose the embedding size , the number of attention heads , and the feed-forward expansion size . With these settings aligned to your application and available GPU resources, you can achieve strong performance.
Summary
Transformer and self-attention are designed so that "every token looks at every other token at the same time" to understand the whole sentence at once. Instead of reading tokens one-by-one like a relay, self-attention works like an overview: it computes which tokens are most relevant to the current token, then uses those relationships to rebuild context.
The core idea is . A Query represents what we are asking for, a Key captures features of each token, and their dot product produces a relevance score. To prevent scores from exploding, scale by and apply softmax to get weights . Finally, mix the Value vectors using those weights to form the final context representation (context vector). This allows far-away words to be connected with a very short path close to .
The relationship information is then expanded by multi-head attention: several heads ("experts") analyze different aspects in parallel, and their results are combined with . After that, Add & Norm (residual connection and normalization) keeps information stable, and Feed Forward (FFNN) transforms each token representation into a deeper, more complex form that forms the base of large models like BERT and GPT.
Tips for solving the problems
Summary — Self-attention lets all tokens reference each other at the same time, so the model can understand context. Using Query (Q), Key (K), and Value (V), it computes the relationship score matrix , enabling long-range dependency with a short path close to . In practice, multi-head attention (multiple attention heads in parallel) is the default way to analyze text from multiple viewpoints.
- TypeSelf-attention 3 components
- Solution / Example (keyword -> answer)Query, Key, Value matrices. → choose Q, K, V for concept questions
- TypeScaled Dot-Product
- Solution / Example (keyword -> answer)Scale attention scores by to stabilize. → prevent exploding values
- TypePath length (RNN vs. attention)
- Solution / Example (keyword -> answer)RNN is , self-attention is . → solve long-range dependency
- TypeMulti-Head Attention
- Solution / Example (keyword -> answer)Use multiple attention heads in parallel to learn diverse features. → richer representations
- TypeMeaning of attention matrix
- Solution / Example (keyword -> answer)Softmax creates weights whose rows sum to 1. → how strongly to focus
- TypeHow Q, K, V are generated
- Solution / Example (keyword -> answer)From input using learned projections .
| Type | Solution / Example (keyword -> answer) |
|---|---|
| Self-attention 3 components | Query, Key, Value matrices. → choose Q, K, V for concept questions |
| Scaled Dot-Product | Scale attention scores by to stabilize. → prevent exploding values |
| Path length (RNN vs. attention) | RNN is , self-attention is . → solve long-range dependency |
| Multi-Head Attention | Use multiple attention heads in parallel to learn diverse features. → richer representations |
| Meaning of attention matrix | Softmax creates weights whose rows sum to 1. → how strongly to focus |
| How Q, K, V are generated | From input using learned projections . |
Example (Concept Understanding)
"In self-attention, what does the dot product of Query (Q) and Key (K) () represent?
① token length
② relation strength (similarity) between tokens
③ position information"
The dot product score reflects how related two tokens are in context. → Answer 2
Example (O/X)
"After softmax, the sum of attention weights for one token is usually 1. True or False?"
Softmax turns scores into probabilities, so they sum to 1. → Answer 1
Example (Scenario-based)
"In a long customer support chat, what model component is most suitable when early negations flip the meaning of later sentences?
① Self-attention
② only average pooling
③ a simple rule system"
You need to directly reference the early negation and connect it to later text, so self-attention is appropriate. → Answer 1
Example (Voting / Count)
"If 5 heads vote as [1,1,0,1,0], how many 1s are there?"
. → Answer 3
Example (Prediction Aggregation)
"If the three heads' class-1 prediction counts are [2,1,2], what is the total?"
. → Answer 5
Example (Model Dimension Config)
"If there are 8 heads and each head has dimension 8, what is the model dimension ?"
. → Answer 64
Example (Ensemble Principle)
"What is the key benefit of combining multi-head results?
① different relationships are learned from multiple perspectives, improving generalization
② parameters become 0
③ computation always becomes 0"
Combining multiple perspectives reduces errors and improves generalization. → Answer 1