Transformer Lineage: Encoder (BERT) vs Decoder (GPT)
The Transformer evolved into two great lineages. BERT, from the encoder clan (understanding models), reads a whole sentence at a glance; GPT, from the decoder clan (generation models), keeps inventing the next token from what came before. If BERT is the ace of 'college-entrance cloze reading,' GPT is the prodigy of 'word chains and novel writing.' This chapter explains how the two models learn, and why their roles in industry differ completely—using analogies beginners can grasp.
Reading the formulas
ht(0)=xt+PE(t) is the starting token representation. xt carries word meaning and PE(t) carries position, so the model gets both "what" and "where" from the first layer.Attn(Q,K,V)=softmax(QKT/dk)V is the core self-attention operation. It computes relevance with QKT, stabilizes scale with dk, then mixes V using softmax weights to build context-aware representations.LMLM=−∑t∈Mlogp(wt∣X) is BERT's MLM objective. The model predicts ground-truth masked tokens wt over positions M, learning to use bidirectional context.LLM=−∑tlogp(xt∣x<t) is GPT's next-token objective. Each step conditions only on past tokens x<t, enabling autoregressive generation and preventing future leakage via causal masking.
BERT builds representations by attending across the whole sentence; GPT appends the next token using only the tokens so far.
Concept structure: encoder (understand) vs decoder (generate)
I love deep learning token relations
BERT
Bidirectional: each token attends to all tokens
The left mini visual simplifies Multi-Head Attention. Multiple heads attend to different relations in parallel, then combine them (concat + projection) into a bidirectional context representation.
GPT
Causal: current token attends only to left/past tokens
future blocked
The right mini visual shows Masked Multi-Head Attention. It keeps the same multi-head structure but applies a causal mask so the current position cannot see future tokens, enabling autoregressive next-token learning.
1. BERT: bidirectional reading for “understanding” (encoder)
Concept: BERT (Bidirectional Encoder Representations from Transformers) grows out of the Transformer encoder alone. The core is bidirectional context: left and right words are used together to build the most faithful representation of what the current word means.
Intuition: like a master clinician who lays out past history (left) and today’s tests (right) at once and decides holistically—seeing the whole picture makes context understanding strong.
Math: BERT’s flagship training is MLM (Masked Language Modeling): punch a hole (`[MASK]`) in the sentence and train the distribution p(wt∣full context) for the correct token wt.
ML use case: text classification (“positive or negative review?”), named-entity recognition (“find names and dates”), document search, and more.
2. GPT: endlessly “generating” the next token (decoder)
Concept: GPT (Generative Pre-trained Transformer) develops the Transformer decoder. The model is not allowed to see the full sentence at once: a mask hides future words so that only past tokens (1…t−1) are used to predict the next token t—autoregressive behavior.
Intuition: like a novelist at a typewriter—you cannot see the next sentence in advance; you imagine the next word from what you have already written.
Math: to stop future information leaking, causal masking sets the upper triangle of the attention matrix to −∞. Training maximizes −logp(xt∣x<t) for the next token xt given x<t.
ML use case: chat replies, email drafts, code completion—anything that creates new text.
3. Different training goals: cloze vs word-chain
Concept: different skeletons mean different training. BERT drills representations by erasing a span and guessing from neighbors. GPT drills generation by only looking left and continuing the sentence.
Intuition: BERT’s camp hands out “Yesterday I ate at [MASK]” and asks whether “restaurant” fits. GPT’s camp shows only “Yesterday I ate at a restaurant…” and makes you keep inventing what comes next.
4. Inference: instant scan vs streaming generation
Concept: UX differs in production. BERT, given a full sentence, can compute meaning vectors in one go. GPT takes a prompt, emits one token, feeds it back, and repeats—streaming text out token by token.
Intuition: BERT is like a scanner that reads the page once and outputs a label. GPT is like a live typist or simultaneous interpreter, adding one token at a time—so latency grows with output length.
Why it matters
Right tool: know each model’s specialty or pay the price
A common mistake is “GPT is hot—use it for everything.” Using a huge GPT just to classify reviews as positive/negative is overkill. Asking BERT to write poetry gives nonsense. Knowing the lineage helps you balance cost and performance in architecture.
Masking rules shape safety and personality
GPT’s causal mask enforces the rules of generation. For stock paths or fraud logs, temporal order matters—blocking future peeking prevents data leakage. When you already have the full past and need diagnosis, bidirectional BERT is often stronger.
Different pipeline entrances (classifier vs prompt)
Fine-tuning differs. BERT usually adds a small classifier head on top. GPT more often uses prompt engineering or instruction tuning in dialogue form instead of reshaping the backbone.
Hallucination: a design anchor
Generative models can sound confident when wrong. Production often pairs an encoder-style retriever on internal docs with a GPT-style generator—RAG—to combine both lineages’ strengths.
How it is used
Production (BERT): embeddings, then a head
BERT compresses text into embedding vectors. For spam, feed the email, take the pooled `[CLS]` vector, pass it through a small logistic or linear head for a 0–1 score—fast and accurate for high-volume backend classification.
Production (GPT): chain generation with reins
Serving GPT needs reins so answers don’t drift. Practitioners tune temperature (creativity vs precision) and top-k / top-p (limit to likely tokens). Coding assistants use low temperature; marketing copy may use higher temperature.
Cost and GPU compute
BERT-scale models (hundreds of MB to a few GB) often run on one cheap GPU or even CPU. Large GPT-style LLMs can need tens–hundreds of GB; cost and latency scale brutally with output length—budget accordingly.
Debugging: what broke?
Weak BERT classifiers: noisy labels, bad annotation, or insufficient capacity. Chatty GPT going off-script: vague prompts, hallucination, or too little context. Diagnose along those lines.
Summary
The Transformer branched into two great houses: BERT-style encoders that read a sentence bidirectionally and build rich representations, and GPT-style decoders that generate the next token from left context only—like cloze tests vs word chains. Training differs: BERT often uses MLM to fill masked spans; GPT maximizes next-token likelihood under a causal mask. In deployment, BERT often scores or embeds a fixed input in one pass, while GPT extends a prompt token by token—so latency grows with output length. Modern products mix both: retrieval with encoders, drafting with decoders, and RAG to curb hallucination.
How to approach the exercises
Summary — BERT (encoder clan) centers on bidirectional understanding and representations; GPT (decoder clan) centers on autoregressive generation with only left context. BERT learns contextual embeddings via MLM; GPT trains p(xt∣x<t). Inference: BERT is closer to one-shot vector readout; GPT is closer to streaming token-by-token growth.
TypeBERT family
Hint (keyword → idea)Encoder-only, bidirectional context, representation learning → look for "encoder / understanding"
TypeGPT family
Hint (keyword → idea)Decoder, causal mask, next-token prediction → look for "generation / autoregressive"
TypeMLM
Hint (keyword → idea)Mask some tokens and reconstruct; loss on masked positions → training objective
TypeCausal LM
Hint (keyword → idea)−∑logp(xt∣x<t) left conditioning → generative training
TypeMasking
Hint (keyword → idea)Block future positions (e.g., −∞ in attention) → prevent peeking