Ch.05
Vision Transformer (ViT) and Image Patches
Split the image into a patch grid. Each patch is flattened into a number vector (). A summary slot (CLS) and where-each-patch-sat info are added, then the encoder and classifier run.
PatchifyLinear embedToken rowEncoderClassify
Learning flow at a glance
- ① Patchify: tile the image.
- ② Tokenize: embed each patch and add positions.
- ③ Encoder: repeat MHA + FFN blocks.
- ④ Classify: read CLS (or pooling) with a head.
In Chapters 01–03 you learned the Transformer encoder: a machine for token sequences. Vision Transformer (ViT) brings the same idea to images. The recipe is simple: cut a photo into small patches, turn each patch into a token vector, add positional information, and run a standard encoder—just like BERT, but the “words” are patches.
Intuition: A CNN slides local filters to build features; ViT “writes the image as a sentence” so patches can look at each other with self-attention.
Math in one line: Flatten patch to , linearly map with , add position embeddings , stack encoder blocks. Classification often uses a [CLS] token readout + linear head.
Practice: ViT/Swin/ConvNeXt-style backbones appear in medical/industrial inspection, satellite imagery, scanned documents, and multimodal models. Chapter 04’s long-sequence cost (~) returns when resolution grows—pair ViT with sensible patch sizes, efficient attention, or hybrid designs.
Formulas (ViT flow)
Patch vector & linear embedding
Flatten a patch to and apply a linear map to model dimension .
is the patch token carrying local appearance in dimensions.
Add positions
Add positional information to token .
So attention knows *where* each patch lived spatially.
Self-attention block
With token matrix , compute and
Same as Chapter 01—only tokens are patches, not words.
CLS & classifier head
Take the first token and apply a linear classifier.
CLS learns to summarize the whole image for the task.
Vision Transformer: Turning Images into Patch Tokens
1. Why patches? Turn pixels into tokens
Concept: Transformers consume ordered tokens . ViT tiles the image into patches; each patch (with channels) is flattened and linearly projected to dimension .
Intuition: Like numbered puzzle pieces that talk to each other to infer the whole scene.
Practice: Larger patches → fewer tokens but less detail; smaller patches → more tokens → Chapter 04-style cost rises.
2. Patch embedding, CLS, positions
Concept: Add learned positional embeddings so the model knows *where* each patch came from. Image classification often prepends a [CLS] token and reads logits from its final hidden state.
Math habit: Let be CLS and for patch tokens—same picture as BERT.
Practice: Detection/segmentation variants may tokenize differently (queries, pixels).
3. Encoder vs CNN
Concept: A ViT backbone models global patch interactions with self-attention instead of stacking only local convolutions. Hybrid models add a small conv stem.
Intuition: CNNs repeat local windows; dense ViT attention is closer to an “everyone-in-the-room meeting” (for standard ViT).
Practice: With enough data or strong pretraining ViT shines; small data may need augmentation, pretraining, or CNN priors.
4. Training & inference notes
Concept: Usually cross-entropy for classification. Compute/memory grows with the number of patch tokens.
Practice: Raising resolution increases tokens—revisit efficient attention, Swin windows, merging, etc., as in Chapter 04.
Why it matters
Bridge text Transformers to vision
Reuse encoder/attention/transfer-learning ideas from NLP directly for images—also common as the vision tower in multimodal systems.
Model long-range context globally
Self-attention can relate distant patches without deep stacks of local convolutions—useful when wide context matters (medical/satellite).
Resolution ↔ token count ↔ cost
Attention scales ~ with token count . Serving cost tracks input resolution, patch size, and kernel efficiency.
Connects to Ch01–04
Same self-attention + positions + FFN; only the input becomes patch embeddings. Chapter 04 memory issues reappear at high-resolution ViT.
How it is used
Training: pretrain + finetune
Start from an ImageNet-pretrained ViT, attach your classifier head, finetune. With little data, use strong augmentation, regularization, or smaller models.
Inference: resolution & batch
Fix input size or use sliding windows. If you hit GPU limits, tune batch/resolution/AMP.
Pick the right backbone
Compare Swin, ConvNeXt, CNN+ViT hybrids—balance data, latency, and accuracy; ViT is not always the answer.
Debug checklist
If metrics are bad: check patch size, positions, CLS, pretrained weights, resolution distribution. If OOM: tokens, efficient attention, checkpointing.
Summary
One-liner — ViT cuts an image into patches, maps each patch to a token, and learns patch relations with a Transformer encoder.
Link to BERT/GPT: Same encoder stack idea; inputs are patch tokens instead of subwords.
vs CNN: Less emphasis on stacked local convolutions; self-attention models global interactions directly (except hybrids).
Ops: pretrain/finetune, resolution & patch size, OOM, and Chapter 04 efficient attention tools.
How to approach problems
ViT items follow image → patch tokens → CLS & PE → encoder, and patches ⇒ ~ dense attention cost. Patch count is (square patches). OOM: reduce resolution, increase patch size, or use efficient attention from the prior chapter.
Example (concept) — “How does ViT turn an image into tokens?”
② patchify → embed → encoder → 2.
Example (calc) — “, patch , patch tokens (no CLS)?” , → 196.
Example (T/F) — “ViT usually adds positional embeddings after patch embedding.” → 1.
Example (application) — “OOM on high-res.”
① labels only
② larger patches
③ rename optimizer → 2.
Patch count — , patch , no CLS → 4; side 8 patches → 64.
Definition — "ViT’s core is only stacked convolutions without patch tokens." True=1, False=0 → False. Answer 0
T/F — "[CLS] is commonly used with a classification head." → True. Answer 1
Choice — "To reduce tokens/OOM, often:
① increase patch size
② only add labels" →
①. Answer 1
Calc — " image, patches → how many patches (square tiling)?" → . Answer 16