Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.01

Transformer 1: Self-Attention and Parallelization

The heart of a Transformer model is Self-Attention, Add & Norm (residual connection and layer normalization) that keeps training stable, and a Feed Forward neural network that transforms gathered information deeply. If older models read tokens one by one and often lose earlier context, Transformers process the entire sentence at once like a top-down overview. In this chapter, you will learn the attention mechanism with Query, Key, and Value, and the intuitive roles of Add & Norm and Feed Forward that help the model learn deeply and reliably.

Understanding the core formulas

Q=XWQQ=XW_QQ=XWQ​, K=XWKK=XW_KK=XWK​, V=XWVV=XW_VV=XWV​ where XXX is the input embeddings and WQ/WK/WVW_Q/W_K/W_VWQ​/WK​/WV​ are learnable projection matrices. This step splits the same text into "query-like", "key-like", and "value-like" representations.S=QKTS=QK^TS=QKT is the token-to-token relevance score matrix. Larger scores mean stronger relationships, but when the dimension is large the raw values can grow too much—so we divide by dk\sqrt{d_k}dk​​ to stabilize them.A=softmax(S/dk)A=\mathrm{softmax}(S/\sqrt{d_k})A=softmax(S/dk​​) produces a weight matrix whose rows sum to 1, meaning each token decides how much it should attend to others.O=AVO=AVO=AV is the final context representation created by mixing values using weights AAA. The key is that it is a weighted average based on importance, not a plain average.
Self-attention is an operation where each token looks at all tokens and reconstructs context.

Concept structure: Q/K/V → scores → normalization → weighted sum

Input tokensIlovedeeplearningToken relations (self-attention)Transformer Encoder BlockMulti-Head AttentionAdd & NormFeed ForwardContext outputA context vector issummary of attended info
Weak referenceMedium referenceStrong reference

Transformer 1: See Self-Attention at a Glance

Concept Explanation: The Eye for Context
Self-attention lets every token look at all other tokens at the same time, and it assigns weights to decide how much each token should influence the meaning of the current token. For example, in the phrase "I went to the bank", self-attention can infer whether "bank" refers to a financial place or a riverbank by looking at surrounding words at the same time.
Intuition: Query (Q), Key (K), Value (V)
Think of it like searching in a library.
1. Query (Q) is what you type: the question you want to answer.
2. Key (K) is the information on book labels: what each token contains as searchable features.
3. Value (V) is the actual content you get.
Self-attention scores how well Q matches K, then mixes V according to the scores to produce the updated meaning.
Mathematical Explanation: Scaled Dot-Product Attention
Let the input embeddings be a matrix XXX. We project them into Q=XWQQ=XW_QQ=XWQ​, K=XWKK=XW_KK=XWK​, and V=XWVV=XW_VV=XWV​ using learnable matrices. Attention scores come from the dot product QKTQK^TQKT. When the dimension is large, dot-product values can become too big, so we divide by dk\sqrt{d_k}dk​​ for scaling. After applying softmax, we obtain weights A=softmax(QKT/dk)A=\mathrm{softmax}(QK^T/\sqrt{d_k})A=softmax(QKT/dk​​). The final output is AVAVAV. Here, dkd_kdk​ is the key dimension, and AAA is the weight matrix that tells how much each token attends to others.
Real ML Example: Smarter Sentence Understanding
In spam filtering, words like "free" and "click" may be far apart, but self-attention captures their strong relationship at once and helps decide whether a message is spam. In medical document classification, it can connect symptoms, lab results, and negations like "no" in the same step to reduce false diagnoses.

Why it matters

Concept Explanation: Perfectly Solving Long-Range Dependency
Self-attention is effective because it can capture long-range dependency directly. When a word at the beginning of a sentence influences the meaning at the end, self-attention connects them without losing information in the middle.
Intuition: Relay (RNN) vs. Group Chat (Self-Attention)
RNN-style processing passes information like a relay race, so information can weaken as it moves step by step. Self-attention is like a group chat: everyone sees all messages at the same time, so distant information is available immediately.
Mathematical Explanation: Shorter Information Path
In RNNs, the information path length grows with the token distance nnn (about O(n)O(n)O(n)), which makes gradient flow harder for long distances. In self-attention, all tokens are connected in a single step, so the path length is about O(1)O (1)O(1). Short paths make training more stable and help preserve important dependencies.
Real ML Example: Summarizing Long Texts
The advantage is most obvious in tasks like summarizing long legal documents or analyzing long chat logs where key points at the beginning must connect to conclusions at the end.

How it is used

Practical Assembly: Pre-work and a Conveyor Belt
In real systems, you prepare text before feeding it into the model. First, split the sentence into token pieces and convert them to vectors (embeddings). Then add positional encoding so the model knows the order—because attention itself behaves like a "set" operation that ignores order. Positional encoding acts like a tiny tag such as "this is the k-th token". After that, the data goes through a big transformer block: [multi-head attention → Add & Norm → feed-forward → Add & Norm]. Repeating this block 12 times leads to a BERT-Base style model, while repeating 96 times reaches GPT-3 scale.
Multi-Head Attention: Expert Committee Division
Instead of relying on one perspective, you use an expert committee: multiple heads in parallel. For example, if you have 512-dimensional vectors, split them into 8 heads so each head processes 64-dimensional slices from its own viewpoint. One head may focus on grammatical relations like "who did what", another on sentiment nuances like positive vs. negative cues, and another on named entities such as people and places. Each head computes attention using headh=softmax(QhKhT/dk)Vh\mathrm{head}_h=\mathrm{softmax}(Q_hK_h^T/\sqrt{d_k})V_hheadh​=softmax(Qh​KhT​/dk​​)Vh​. After that, concatenate the 64-dim outputs from all heads (8 heads) and project back to 512 dimensions to form a rich representation.
Feed Forward + Activation: Microscopic Observation and Feature Extraction
Once attention has identified relationships between tokens, the feed-forward network (FFNN) updates each token representation independently and deeply. In practice, it often uses an hourglass structure: expand the dimension (e.g., 512 → 2048) to observe fine-grained patterns, then apply a non-linear activation such as ReLU or GELU, and finally compress back to the original size. Using a formula: FFN(x)=max⁡(0,xW1+b1)W2+b2\mathrm{FFN}(x)=\max(0, xW_1 + b_1)W_2 + b_2FFN(x)=max(0,xW1​+b1​)W2​+b2​. This helps the model move beyond simple pattern matching and learn complex concepts.
Code and Framework Usage: Hyperparameter Tuning
All this complex math is packaged in frameworks like PyTorch and TensorFlow as a single building block: `nn.TransformerEncoderLayer`. Practitioners usually do not derive everything from scratch; instead they tune key hyperparameters. You choose the embedding size dmodeld_{model}dmodel​, the number of attention heads nheadn_{head}nhead​, and the feed-forward expansion size dimfeedforwarddim_{feedforward}dimfeedforward​. With these settings aligned to your application and available GPU resources, you can achieve strong performance.

Summary

Transformer and self-attention are designed so that "every token looks at every other token at the same time" to understand the whole sentence at once. Instead of reading tokens one-by-one like a relay, self-attention works like an overview: it computes which tokens are most relevant to the current token, then uses those relationships to rebuild context.
The core idea is Q,K,VQ, K, VQ,K,V. A Query QQQ represents what we are asking for, a Key KKK captures features of each token, and their dot product QKTQK^TQKT produces a relevance score. To prevent scores from exploding, scale by dk\sqrt{d_k}dk​​ and apply softmax to get weights A=softmax(QKT/dk)A=\mathrm{softmax}(QK^T/\sqrt{d_k})A=softmax(QKT/dk​​). Finally, mix the Value vectors VVV using those weights to form the final context representation (context vector). This allows far-away words to be connected with a very short path close to O(1)O (1)O(1).
The relationship information is then expanded by multi-head attention: several heads ("experts") analyze different aspects in parallel, and their results are combined with Concat\mathrm{Concat}Concat. After that, Add & Norm (residual connection and normalization) keeps information stable, and Feed Forward (FFNN) transforms each token representation into a deeper, more complex form that forms the base of large models like BERT and GPT.

Tips for solving the problems

Summary — Self-attention lets all tokens reference each other at the same time, so the model can understand context. Using Query (Q), Key (K), and Value (V), it computes the relationship score matrix A=softmax(QKT/dk)A=\mathrm{softmax}(QK^T/\sqrt{d_k})A=softmax(QKT/dk​​), enabling long-range dependency with a short path close to O(1)O (1)O(1). In practice, multi-head attention (multiple attention heads in parallel) is the default way to analyze text from multiple viewpoints.
  • TypeSelf-attention 3 components
  • Solution / Example (keyword -> answer)Query, Key, Value matrices. → choose Q, K, V for concept questions
  • TypeScaled Dot-Product
  • Solution / Example (keyword -> answer)Scale attention scores by dk\sqrt{d_k}dk​​ to stabilize. → prevent exploding values
  • TypePath length (RNN vs. attention)
  • Solution / Example (keyword -> answer)RNN is O(n)O(n)O(n), self-attention is O(1)O(1)O(1). → solve long-range dependency
  • TypeMulti-Head Attention
  • Solution / Example (keyword -> answer)Use multiple attention heads in parallel to learn diverse features. → richer representations
  • TypeMeaning of attention matrix AAA
  • Solution / Example (keyword -> answer)Softmax creates weights whose rows sum to 1. → how strongly to focus
  • TypeHow Q, K, V are generated
  • Solution / Example (keyword -> answer)From input XXX using learned projections WQ,WK,WVW_Q, W_K, W_VWQ​,WK​,WV​.
TypeSolution / Example (keyword -> answer)
Self-attention 3 componentsQuery, Key, Value matrices. → choose Q, K, V for concept questions
Scaled Dot-ProductScale attention scores by dk\sqrt{d_k}dk​​ to stabilize. → prevent exploding values
Path length (RNN vs. attention)RNN is O(n)O(n)O(n), self-attention is O(1)O(1)O(1). → solve long-range dependency
Multi-Head AttentionUse multiple attention heads in parallel to learn diverse features. → richer representations
Meaning of attention matrix AAASoftmax creates weights whose rows sum to 1. → how strongly to focus
How Q, K, V are generatedFrom input XXX using learned projections WQ,WK,WVW_Q, W_K, W_VWQ​,WK​,WV​.
Example (Concept Understanding)
"In self-attention, what does the dot product of Query (Q) and Key (K) (QKTQK^TQKT) represent?
① token length
② relation strength (similarity) between tokens
③ position information"
The dot product score reflects how related two tokens are in context. → Answer 2

Example (O/X)
"After softmax, the sum of attention weights for one token is usually 1. True or False?"
Softmax turns scores into probabilities, so they sum to 1. → Answer 1

Example (Scenario-based)
"In a long customer support chat, what model component is most suitable when early negations flip the meaning of later sentences?
① Self-attention
② only average pooling
③ a simple rule system"
You need to directly reference the early negation and connect it to later text, so self-attention is appropriate. → Answer 1

Example (Voting / Count)
"If 5 heads vote as [1,1,0,1,0], how many 1s are there?"
1+1+0+1+0=31+1+0+1+0=31+1+0+1+0=3. → Answer 3

Example (Prediction Aggregation)
"If the three heads' class-1 prediction counts are [2,1,2], what is the total?"
2+1+2=52+1+2=52+1+2=5. → Answer 5

Example (Model Dimension Config)
"If there are 8 heads and each head has dimension 8, what is the model dimension dmodeld_{model}dmodel​?"
dmodel=8×8=64d_{model}=8\times8=64dmodel​=8×8=64. → Answer 64

Example (Ensemble Principle)
"What is the key benefit of combining multi-head results?
① different relationships are learned from multiple perspectives, improving generalization
② parameters become 0
③ computation always becomes 0"
Combining multiple perspectives reduces errors and improves generalization. → Answer 1