Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.02

Transformer: Positional Encoding and Feed-Forward

Self-attention captures relationships between tokens well, but it does not fully provide which position in the sentence each token occupies. Transformers therefore add positional encoding (PE) to token embeddings so the model knows which word comes where. After mixing relations in a block, a feed-forward (FFN) layer updates each token representation in depth. This chapter explains sinusoidal PE, how it differs from learned positional embeddings, and the per-token MLP role of FFN in a beginner-friendly way.

Reading the formulas

In ht(0)=xt+PE(t)h_t^{(0)} = x_t + PE(t)ht(0)​=xt​+PE(t), xtx_txt​ is the token embedding and PE(t)PE(t)PE(t) is the vector for position ttt. You add content and order (as numbers) to form the model input.Sinusoidal PE uses PE(t,2i)=sin⁡(t/100002i/d)PE(t,2i)=\sin(t/10000^{2i/d})PE(t,2i)=sin(t/100002i/d) and PE(t,2i+1)=cos⁡(t/100002i/d)PE(t,2i+1)=\cos(t/10000^{2i/d})PE(t,2i+1)=cos(t/100002i/d) to encode position with multiple frequencies iii. Here ddd is dmodeld_{model}dmodel​ and ttt is the token index.In FFN(h)=W2 σ(W1h+b1)+b2\mathrm{FFN}(h)=W_2\,\sigma(W_1 h+b_1)+b_2FFN(h)=W2​σ(W1​h+b1​)+b2​, σ\sigmaσ is a nonlinearity, W1W_1W1​ maps dmodel→dffd_{model}\to d_{ff}dmodel​→dff​, and W2W_2W2​ maps dff→dmodeld_{ff}\to d_{model}dff​→dmodel​.Weight sharing—the same FFN at every position—helps generalization and keeps implementation simple.
Top
Read left to right—each column adds word meaning and order turned into numbers (PE).
Bottom
Lanes don’t mix; each passes through the same compute block once (same weights, same ops).
Papers call this compute block FFN.

① Build inputs → (middle steps omitted) → ② Same FFN per lane

① first, then
②—in order inside one block.
Order inside one block
①→②
① adds meaning + order (PE) to build the input. (Attention in between is skipped in the figure.)
② then refines each lane with the same FFN. Lanes don’t mix.
① Put meaning + order numbers (PE) together
Like noting which word in the sentence this is, as numbers.
A→B→C→D0indexmeaning+orderA fusedmeaning + order1indexmeaning+orderB fusedmeaning + order2indexmeaning+orderC fusedmeaning + order3indexmeaning+orderD fusedmeaning + order
↓② Same compute block polishes each lane (FFN)Four lanes, no cross-talk—same compute block each timelane 1inputwider layernonlinearoutputlane 2inputwider layernonlinearoutputlane 3inputwider layernonlinearoutputlane 4inputwider layernonlinearoutput
All four lanes: same compute block (W₁, W₂ shared)
① Put meaning + order numbers (PE) together
Like noting which word in the sentence this is, as numbers.
A→B0indexmeaning+orderA fusedmeaning + order1indexmeaning+orderB fusedmeaning + orderC→D2indexmeaning+orderC fusedmeaning + order3indexmeaning+orderD fusedmeaning + order
↓② Same compute block polishes each lane (FFN)Four lanes, no cross-talk—same compute block each timelane 1inputwider layernonlinearoutputlane 2inputwider layernonlinearoutputlane 3inputwider layernonlinearoutputlane 4inputwider layernonlinearoutput
All four lanes: same compute block (W₁, W₂ shared)
indexmeaningordercompute block (FFN)

Transformer: Positional Encoding and Feed-Forward

1. Concept: why positional encoding?
Self-attention scores all tokens at once; if inputs are only token embeddings in a bag, first vs last can blur. Positional encoding builds a vector PE(p)PE(p)PE(p) of length dmodeld_{model}dmodel​ for each position ppp and adds it to embeddings.
Intuition: like seat row/column labels in a theater—PE tags each token seat.
Math: let token embedding be xt∈Rdmodelx_t \in \mathbb{R}^{d_{model}}xt​∈Rdmodel​; often ht(0)=xt+PE(t)h_t^{(0)} = x_t + PE(t)ht(0)​=xt​+PE(t).
Use: translation, summarization, QA—word order matters, so BERT/GPT always add position.
2. Concept: sinusoidal PE (clock analogy)
Intuition first: picture an analog clock. The second hand moves fast, the minute hand medium, the hour hand slow. The combined directions of the three hands tell you what time it is—similar to tagging which word slot you are in. Because each hand spins at a different speed, it is also easier to see whether two times are close or far apart (relative distance). Sinusoidal PE is the same idea in math: stack several slow and fast waves so every position gets its own number pattern.
One step more technical: the original Transformer fills the position vector with repeating sin⁡\sinsin on some dimensions and cos⁡\coscos on paired dimensions, using several frequency bands so the model can separate nearby vs distant positions.
Formula (reference, not for memorizing): even 2i2i2i: PE(t,2i)=sin⁡(t/100002i/dmodel)PE(t,2i)=\sin(t/10000^{2i/d_{model}})PE(t,2i)=sin(t/100002i/dmodel​); odd 2i+12i+12i+1: cos⁡\coscos with the same exponent. ttt is token index, iii is dimension index, dmodeld_{model}dmodel​ is embedding width.
Plain-language unpack: think of it as building a numeric fingerprint for each slot ttt. The vector has dmodeld_{model}dmodel​ entries; consecutive pairs behave like waves spinning at different speeds (different frequencies). ttt = which word position (0,1,2,…). iii = which frequency band / which dimension pair you are using. dmodeld_{model}dmodel​ sets the overall scale so the waves are not all extremely fast or slow. Neighboring positions change the pattern a little; distant positions look more different, which helps the model read relative distance along the sentence. Pairing sin⁡\sinsin and cos⁡\coscos is like writing a rotating hand with two coordinates—more stable than a single wave alone (details are optional).
Use: long-context encoders; later methods include RoPE.
3. Concept: feed-forward (FFN) — a “deep chat” per token
One line: Attention is where tokens mix with each other; FFN is the next step where each token lane stays separate and the same small network runs once per lane (like the green compute blocks in the figure above).
Analogy: after a group meeting (attention), everyone walks into a booth one by one for a private follow-up (FFN). The vector width dmodeld_{model}dmodel​ is often expanded in the middle (wider hidden) and then compressed back—an hourglass shape.
Why bother? Attention is mostly linear maps + mixing; FFN adds nonlinearity (e.g. ReLU, max⁡(0,⋅)\max(0,\cdot)max(0,⋅)) so the model can learn curved, complex rules, not only straight-line patterns.
Math (reference): FFN(x)=max⁡(0,xW1+b1)W2+b2\mathrm{FFN}(x)=\max(0,xW_1+b_1)W_2+b_2FFN(x)=max(0,xW1​+b1​)W2​+b2​. Weights W1,W2W_1,W_2W1​,W2​ are usually shared across all positions.
Use: NER, sentiment—attention collects context; FFN sharpens each token.
4. Concept: flow inside one block — one conveyor step
One line: Each encoder block is like one station on an assembly line: always the same order of steps.
Easy order:
1. Start: add PE to embeddings so each token “knows” its slot.
2. Mix: Attention lets tokens swap context.
3. Stabilize: Add & Norm — add a skip/residual so signals don’t vanish, then layer-normalize scales.
4. Per-token polish: FFN updates each lane with nonlinearity.
5. Again Add & Norm to finish the station.
Math (reference): h′=LayerNorm(h+Attn(h))h'=\mathrm{LayerNorm}(h+\mathrm{Attn}(h))h′=LayerNorm(h+Attn(h)), then h′′=LayerNorm(h′+FFN(h′))h''=\mathrm{LayerNorm}(h'+\mathrm{FFN}(h'))h′′=LayerNorm(h′+FFN(h′)). Stack many such blocks to build rich representations.
Use: search, chatbots, codegen—repeat dozens of times.

Why it matters

Order changes meaning
"I ate rice" vs "rice ate I" differ grammatically. Without PE, models struggle to keep this consistent. Fraud logs also rely on time order.
FFN brings nonlinearity
Attention is largely linear maps plus softmax mixing; FFN expands, applies ReLU/GELU, and learns complex rules—e.g., symptom combinations in clinical text.
Compute trade-offs
Larger dffd_{ff}dff​ and depth raise quality but also GPU cost and latency—key for production tuning.
Foundation for modern models
Absolute embeddings, sinusoidal PE, RoPE, ALiBi… the theme is encode order in tensors. FFN + attention blocks underpin BERT, GPT, ViT.

How it is used

Pipeline: tokenize → embed → +PE
Tokenize, multiply by an embedding matrix, add position vectors. Libraries expose max_position_embeddings for learned tables. Long-document QA must co-design context length.
FFN hyperparameters
intermediate_size (dffd_{ff}dff​), activation (GELU), dropout. Example: dmodel=768d_{model}=768dmodel​=768 → dff=3072d_{ff}=3072dff​=3072 is common. Code models may widen FFN for syntax/style.
Decoder note
Masked attention hides future tokens, but PE still marks left-to-right order for generation quality.
Debugging hints
If order matters, inspect PE/RoPE/context length; if representations are flat, inspect FFN width/depth/activation—common for spam/news tasks.

Summary

Half of why transformers work is attention, but you still need a reliable way to tell the model which slot each token occupies. Sinusoidal PE overlays multiple frequency waves so each position gets a distinct pattern added to embeddings. Later, attention mixes tokens while FFN applies the same nonlinear transformation at every position to refine features. The expand-then-contract FFN is the practical knob between quality and compute—shared across translation, summarization, classification, and generation.

How to approach the exercises

Summary — PE adds order to embeddings; sinusoidal PE stacks sin⁡/cos⁡\sin/\cossin/cos bands; FFN applies a shared MLP per position. In practice dffd_{ff}dff​, depth, and context length move together with cost.
  • TypeWhy PE
  • Hint (keyword → idea)Inject order / absolute vs relative cues → look for "embedding + PE"
  • TypeSinusoidal PE
  • Hint (keyword → idea)Even dims sin⁡\sinsin, odd cos⁡\coscos (classic pairing)
  • TypeAdditive PE
  • Hint (keyword → idea)h=x+PE(pos)h = x + PE(pos)h=x+PE(pos)
  • TypeFFN
  • Hint (keyword → idea)Per-token MLP, often dff>dmodeld_{ff} > d_{model}dff​>dmodel​
  • TypeSharing
  • Hint (keyword → idea)Same FFN weights at all positions
  • TypeTrade-off
  • Hint (keyword → idea)Wider/deeper ↔ more compute/latency
TypeHint (keyword → idea)
Why PEInject order / absolute vs relative cues → look for "embedding + PE"
Sinusoidal PEEven dims sin⁡\sinsin, odd cos⁡\coscos (classic pairing)
Additive PEh=x+PE(pos)h = x + PE(pos)h=x+PE(pos)
FFNPer-token MLP, often dff>dmodeld_{ff} > d_{model}dff​>dmodel​
SharingSame FFN weights at all positions
Trade-offWider/deeper ↔ more compute/latency
Example (concept understanding)
"Self-attention alone always exposes order perfectly."
① True
② Only partly
③ Order doesn’t matter
Not fully—PE (etc.) supplements order. → 2

"In classic sinusoidal PE, which function is used on even dimensions 2i2i2i?"
① cosine only
② sine
③ identity
Typically PE(t,2i)=sin⁡(⋯ )PE(t,2i)=\sin(\cdots)PE(t,2i)=sin(⋯). → 2

"How are token embedding xxx and position vector PEPEPE usually combined?"
① Add x+PE(pos)x+PE(pos)x+PE(pos)
② Concatenate only
③ Elementwise product only
Additive PE is standard. → 1

"Which best describes an FFN block?"
① Build pairwise token scores (attention)
② Per-token nonlinear transform (expand/compress)
③ Only dropout
Per-token MLP. → 2

"In a typical encoder layer, FFN weights across positions are:"
① Different W1,W2W_1,W_2W1​,W2​ per position
② Shared across positions
③ Only PEPEPE is shared
Parameter sharing is common. → 2

"If you increase dffd_{ff}dff​ or depth, what usually rises together?"
① Speed always improves
② Compute, memory, latency
③ Number of labels
Capacity vs cost trade-off. → 2

Example (T/F)
"FFN must use different weights per token." True=1, False=0.
Usually shared weights. → 0

Example (scenario)
"Clinical notes: order of pre/post medication matters. What to strengthen first?"
① embedding+PE
② pixels only
③ filename only
Need order signal. → 1

Example (vote count)
"In the indicator vector [1,1,0,1,0], how many 1s?"
1+1+0+1+0=31+1+0+1+0=31+1+0+1+0=3. → 3

Example (model prediction aggregate)
"Sum of three block scores [2,1,2]?"
2+1+2=52+1+2=52+1+2=5. → 5

Example (model config / calculation)
"With 10 tokens, how many cells in a self-attention score matrix?"
10×10=10010\times10=10010×10=100. → 100

Example (ensemble / depth principle)
"Closest goal of stacking many layers?"
① staged abstraction
② delete data
③ forbid input
Deeper layers build richer representations. → 1