Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.05

Vision Transformer (ViT) and Image Patches

Split the image into a patch grid. Each patch is flattened into a number vector (zi=Exiz_i=Ex_izi​=Exi​). A summary slot (CLS) and where-each-patch-sat info are added, then the encoder and classifier run.

PatchifyLinear embedToken rowEncoderClassify
Input imagePatch grid123456789Patch embeddingFlatten → linear mapSummary slot (CLS)CLSP1P2P3…Add position infoEncoderAttention + feed-forward × layersClassifier head
Input imagePatch grid123456789Patch embeddingFlatten → linear mapSummary slot (CLS)CLSP1P2P3…Add position infoEncoderAttention + feed-forward × layersClassifier head

Learning flow at a glance

  1. ① Patchify: tile the image.
  2. ② Tokenize: embed each patch and add positions.
  3. ③ Encoder: repeat MHA + FFN blocks.
  4. ④ Classify: read CLS (or pooling) with a head.
In Chapters 01–03 you learned the Transformer encoder: a machine for token sequences. Vision Transformer (ViT) brings the same idea to images. The recipe is simple: cut a photo into small patches, turn each patch into a token vector, add positional information, and run a standard encoder—just like BERT, but the “words” are patches.
Intuition: A CNN slides local filters to build features; ViT “writes the image as a sentence” so patches can look at each other with self-attention.
Math in one line: Flatten patch iii to xi∈RP⋅Cx_i\in\mathbb{R}^{P\cdot C}xi​∈RP⋅C, linearly map zi=Exiz_i=E x_izi​=Exi​ with E∈Rd×(P⋅C)E\in\mathbb{R}^{d\times(P\cdot C)}E∈Rd×(P⋅C), add position embeddings hi=zi+PEih_i=z_i+PE_ihi​=zi​+PEi​, stack encoder blocks. Classification often uses a [CLS] token readout + linear head.
Practice: ViT/Swin/ConvNeXt-style backbones appear in medical/industrial inspection, satellite imagery, scanned documents, and multimodal models. Chapter 04’s long-sequence cost (~N2N^2N2) returns when resolution grows—pair ViT with sensible patch sizes, efficient attention, or hybrid designs.

Formulas (ViT flow)

Patch vector & linear embedding
Flatten a patch to x∈RP⋅Cx\in\mathbb{R}^{P\cdot C}x∈RP⋅C and apply a linear map EEE to model dimension ddd.
z=Ex,E∈Rd×(P⋅C)z = E x,\quad E\in\mathbb{R}^{d\times(P\cdot C)}z=Ex,E∈Rd×(P⋅C)
zzz is the patch token carrying local appearance in ddd dimensions.
Add positions
Add positional information to token iii.
hi=zi+PEih_i = z_i + PE_ihi​=zi​+PEi​
So attention knows *where* each patch lived spatially.
Self-attention block
With token matrix HHH, compute Q,K,VQ,K,VQ,K,V and
Attn(H)=softmax(QKTdk)V\mathrm{Attn}(H)=\mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttn(H)=softmax(dk​​QKT​)V
Same as Chapter 01—only tokens are patches, not words.
CLS & classifier head
Take the first token h0(L)h_0^{(L)}h0(L)​ and apply a linear classifier.
y^=Wclsh0(L)+b\hat{y} = W_{\mathrm{cls}} h_0^{(L)} + by^​=Wcls​h0(L)​+b
CLS learns to summarize the whole image for the task.

Vision Transformer: Turning Images into Patch Tokens

1. Why patches? Turn pixels into tokens
Concept: Transformers consume ordered tokens x1,…,xNx_1,\ldots,x_Nx1​,…,xN​. ViT tiles the image into patches; each P×PP\times PP×P patch (with CCC channels) is flattened and linearly projected to dimension ddd.
Intuition: Like numbered puzzle pieces that talk to each other to infer the whole scene.
Practice: Larger patches → fewer tokens but less detail; smaller patches → more tokens → Chapter 04-style cost rises.
2. Patch embedding, CLS, positions
Concept: Add learned positional embeddings so the model knows *where* each patch came from. Image classification often prepends a [CLS] token and reads logits from its final hidden state.
Math habit: Let h0h_0h0​ be CLS and hi=zi+PEih_i=z_i+PE_ihi​=zi​+PEi​ for patch tokens—same picture as BERT.
Practice: Detection/segmentation variants may tokenize differently (queries, pixels).
3. Encoder vs CNN
Concept: A ViT backbone models global patch interactions with self-attention instead of stacking only local convolutions. Hybrid models add a small conv stem.
Intuition: CNNs repeat local windows; dense ViT attention is closer to an “everyone-in-the-room meeting” (for standard ViT).
Practice: With enough data or strong pretraining ViT shines; small data may need augmentation, pretraining, or CNN priors.
4. Training & inference notes
Concept: Usually cross-entropy for classification. Compute/memory grows with the number of patch tokens.
Practice: Raising resolution increases tokens—revisit efficient attention, Swin windows, merging, etc., as in Chapter 04.

Why it matters

Bridge text Transformers to vision
Reuse encoder/attention/transfer-learning ideas from NLP directly for images—also common as the vision tower in multimodal systems.
Model long-range context globally
Self-attention can relate distant patches without deep stacks of local convolutions—useful when wide context matters (medical/satellite).
Resolution ↔ token count ↔ cost
Attention scales ~N2N^2N2 with token count NNN. Serving cost tracks input resolution, patch size, and kernel efficiency.
Connects to Ch01–04
Same self-attention + positions + FFN; only the input becomes patch embeddings. Chapter 04 memory issues reappear at high-resolution ViT.

How it is used

Training: pretrain + finetune
Start from an ImageNet-pretrained ViT, attach your classifier head, finetune. With little data, use strong augmentation, regularization, or smaller models.
Inference: resolution & batch
Fix input size or use sliding windows. If you hit GPU limits, tune batch/resolution/AMP.
Pick the right backbone
Compare Swin, ConvNeXt, CNN+ViT hybrids—balance data, latency, and accuracy; ViT is not always the answer.
Debug checklist
If metrics are bad: check patch size, positions, CLS, pretrained weights, resolution distribution. If OOM: tokens, efficient attention, checkpointing.

Summary

One-liner — ViT cuts an image into patches, maps each patch to a token, and learns patch relations with a Transformer encoder.
Link to BERT/GPT: Same encoder stack idea; inputs are patch tokens instead of subwords.
vs CNN: Less emphasis on stacked local convolutions; self-attention models global interactions directly (except hybrids).
Ops: pretrain/finetune, resolution & patch size, OOM, and Chapter 04 efficient attention tools.

How to approach problems

ViT items follow image → patch tokens → CLS & PE → encoder, and NNN patches ⇒ ~N2N^2N2 dense attention cost. Patch count is (H/p)×(W/p)(H/p)\times(W/p)(H/p)×(W/p) (square patches). OOM: reduce resolution, increase patch size, or use efficient attention from the prior chapter.
Example (concept) — “How does ViT turn an image into tokens?”
② patchify → embed → encoder → 2.

Example (calc) — “224×224224\times224224×224, patch 16×1616\times1616×16, patch tokens (no CLS)?” 224/16=14224/16=14224/16=14, 14×14=19614\times14=19614×14=196 → 196.

Example (T/F) — “ViT usually adds positional embeddings after patch embedding.” → 1.

Example (application) — “OOM on high-res.”
① labels only
② larger patches
③ rename optimizer → 2.

Patch count — 32×3232\times3232×32, patch 16×1616\times1616×16, no CLS → (32/16)2=(32/16)^2=(32/16)2= 4; side 8 patches → 82=8^2=82= 64.
Definition — "ViT’s core is only stacked convolutions without patch tokens." True=1, False=0 → False. Answer 0

T/F — "[CLS] is commonly used with a classification head." → True. Answer 1

Choice — "To reduce tokens/OOM, often:
① increase patch size
② only add labels" →
①. Answer 1

Calc — "32×3232\times3232×32 image, 8×88\times88×8 patches → how many patches (square tiling)?" → 161616. Answer 16