Ch.02
Transformer: Positional Encoding and Feed-Forward
Self-attention captures relationships between tokens well, but it does not fully provide which position in the sentence each token occupies. Transformers therefore add positional encoding (PE) to token embeddings so the model knows which word comes where. After mixing relations in a block, a feed-forward (FFN) layer updates each token representation in depth. This chapter explains sinusoidal PE, how it differs from learned positional embeddings, and the per-token MLP role of FFN in a beginner-friendly way.
Reading the formulas
In , is the token embedding and is the vector for position . You add content and order (as numbers) to form the model input.Sinusoidal PE uses and to encode position with multiple frequencies . Here is and is the token index.In , is a nonlinearity, maps , and maps .Weight sharing—the same FFN at every position—helps generalization and keeps implementation simple.
Top
Read left to right—each column adds word meaning and order turned into numbers (PE).
Bottom
Lanes don’t mix; each passes through the same compute block once (same weights, same ops).
Papers call this compute block FFN.
① Build inputs → (middle steps omitted) → ② Same FFN per lane
① first, then
②—in order inside one block.
Order inside one block
① adds meaning + order (PE) to build the input. (Attention in between is skipped in the figure.)
② then refines each lane with the same FFN. Lanes don’t mix.
indexmeaningordercompute block (FFN)
Transformer: Positional Encoding and Feed-Forward
1. Concept: why positional encoding?
Self-attention scores all tokens at once; if inputs are only token embeddings in a bag, first vs last can blur. Positional encoding builds a vector of length for each position and adds it to embeddings.
Intuition: like seat row/column labels in a theater—PE tags each token seat.
Math: let token embedding be ; often .
Use: translation, summarization, QA—word order matters, so BERT/GPT always add position.
2. Concept: sinusoidal PE (clock analogy)
Intuition first: picture an analog clock. The second hand moves fast, the minute hand medium, the hour hand slow. The combined directions of the three hands tell you what time it is—similar to tagging which word slot you are in. Because each hand spins at a different speed, it is also easier to see whether two times are close or far apart (relative distance). Sinusoidal PE is the same idea in math: stack several slow and fast waves so every position gets its own number pattern.
One step more technical: the original Transformer fills the position vector with repeating on some dimensions and on paired dimensions, using several frequency bands so the model can separate nearby vs distant positions.
Formula (reference, not for memorizing): even : ; odd : with the same exponent. is token index, is dimension index, is embedding width.
Plain-language unpack: think of it as building a numeric fingerprint for each slot . The vector has entries; consecutive pairs behave like waves spinning at different speeds (different frequencies). = which word position (0,1,2,…). = which frequency band / which dimension pair you are using. sets the overall scale so the waves are not all extremely fast or slow. Neighboring positions change the pattern a little; distant positions look more different, which helps the model read relative distance along the sentence. Pairing and is like writing a rotating hand with two coordinates—more stable than a single wave alone (details are optional).
Use: long-context encoders; later methods include RoPE.
3. Concept: feed-forward (FFN) — a “deep chat” per token
One line: Attention is where tokens mix with each other; FFN is the next step where each token lane stays separate and the same small network runs once per lane (like the green compute blocks in the figure above).
Analogy: after a group meeting (attention), everyone walks into a booth one by one for a private follow-up (FFN). The vector width is often expanded in the middle (wider hidden) and then compressed back—an hourglass shape.
Why bother? Attention is mostly linear maps + mixing; FFN adds nonlinearity (e.g. ReLU, ) so the model can learn curved, complex rules, not only straight-line patterns.
Math (reference): . Weights are usually shared across all positions.
Use: NER, sentiment—attention collects context; FFN sharpens each token.
4. Concept: flow inside one block — one conveyor step
One line: Each encoder block is like one station on an assembly line: always the same order of steps.
Easy order:
1. Start: add PE to embeddings so each token “knows” its slot.
2. Mix: Attention lets tokens swap context.
3. Stabilize: Add & Norm — add a skip/residual so signals don’t vanish, then layer-normalize scales.
4. Per-token polish: FFN updates each lane with nonlinearity.
5. Again Add & Norm to finish the station.
Math (reference): , then . Stack many such blocks to build rich representations.
Use: search, chatbots, codegen—repeat dozens of times.
Why it matters
Order changes meaning
"I ate rice" vs "rice ate I" differ grammatically. Without PE, models struggle to keep this consistent. Fraud logs also rely on time order.
FFN brings nonlinearity
Attention is largely linear maps plus softmax mixing; FFN expands, applies ReLU/GELU, and learns complex rules—e.g., symptom combinations in clinical text.
Compute trade-offs
Larger and depth raise quality but also GPU cost and latency—key for production tuning.
Foundation for modern models
Absolute embeddings, sinusoidal PE, RoPE, ALiBi… the theme is encode order in tensors. FFN + attention blocks underpin BERT, GPT, ViT.
How it is used
Pipeline: tokenize → embed → +PE
Tokenize, multiply by an embedding matrix, add position vectors. Libraries expose max_position_embeddings for learned tables. Long-document QA must co-design context length.
FFN hyperparameters
intermediate_size (), activation (GELU), dropout. Example: → is common. Code models may widen FFN for syntax/style.
Decoder note
Masked attention hides future tokens, but PE still marks left-to-right order for generation quality.
Debugging hints
If order matters, inspect PE/RoPE/context length; if representations are flat, inspect FFN width/depth/activation—common for spam/news tasks.
Summary
Half of why transformers work is attention, but you still need a reliable way to tell the model which slot each token occupies. Sinusoidal PE overlays multiple frequency waves so each position gets a distinct pattern added to embeddings. Later, attention mixes tokens while FFN applies the same nonlinear transformation at every position to refine features. The expand-then-contract FFN is the practical knob between quality and compute—shared across translation, summarization, classification, and generation.
How to approach the exercises
Summary — PE adds order to embeddings; sinusoidal PE stacks bands; FFN applies a shared MLP per position. In practice , depth, and context length move together with cost.
- TypeWhy PE
- Hint (keyword → idea)Inject order / absolute vs relative cues → look for "embedding + PE"
- TypeSinusoidal PE
- Hint (keyword → idea)Even dims , odd (classic pairing)
- TypeAdditive PE
- Hint (keyword → idea)
- TypeFFN
- Hint (keyword → idea)Per-token MLP, often
- TypeSharing
- Hint (keyword → idea)Same FFN weights at all positions
- TypeTrade-off
- Hint (keyword → idea)Wider/deeper ↔ more compute/latency
| Type | Hint (keyword → idea) |
|---|---|
| Why PE | Inject order / absolute vs relative cues → look for "embedding + PE" |
| Sinusoidal PE | Even dims , odd (classic pairing) |
| Additive PE | |
| FFN | Per-token MLP, often |
| Sharing | Same FFN weights at all positions |
| Trade-off | Wider/deeper ↔ more compute/latency |
Example (concept understanding)
"Self-attention alone always exposes order perfectly."
① True
② Only partly
③ Order doesn’t matter
Not fully—PE (etc.) supplements order. → 2
"In classic sinusoidal PE, which function is used on even dimensions ?"
① cosine only
② sine
③ identity
Typically . → 2
"How are token embedding and position vector usually combined?"
① Add
② Concatenate only
③ Elementwise product only
Additive PE is standard. → 1
"Which best describes an FFN block?"
① Build pairwise token scores (attention)
② Per-token nonlinear transform (expand/compress)
③ Only dropout
Per-token MLP. → 2
"In a typical encoder layer, FFN weights across positions are:"
① Different per position
② Shared across positions
③ Only is shared
Parameter sharing is common. → 2
"If you increase or depth, what usually rises together?"
① Speed always improves
② Compute, memory, latency
③ Number of labels
Capacity vs cost trade-off. → 2
Example (T/F)
"FFN must use different weights per token." True=1, False=0.
Usually shared weights. → 0
Example (scenario)
"Clinical notes: order of pre/post medication matters. What to strengthen first?"
① embedding+PE
② pixels only
③ filename only
Need order signal. → 1
Example (vote count)
"In the indicator vector [1,1,0,1,0], how many 1s?"
. → 3
Example (model prediction aggregate)
"Sum of three block scores [2,1,2]?"
. → 5
Example (model config / calculation)
"With 10 tokens, how many cells in a self-attention score matrix?"
. → 100
Example (ensemble / depth principle)
"Closest goal of stacking many layers?"
① staged abstraction
② delete data
③ forbid input
Deeper layers build richer representations. → 1