Ch.05

Vision Transformer (ViT) and Image Patches

z_i=Ex_i

PatchifyLinear embedToken rowEncoderClassify

Learning flow at a glance

① Patchify: tile the image. ② Tokenize: embed each patch and add positions. ③ Encoder: repeat MHA + FFN blocks. ④ Classify: read CLS (or pooling) with a head.

i

Formulas (ViT flow)

x\in\mathbb{R}^{P\cdot C}

Vision Transformer: Turning Images into Patch Tokens

x_1,\ldots,x_N

Why it matters

N^2

How it is used

Training: pretrain + finetune Start from an ImageNet-pretrained ViT, attach your classifier head, finetune. With little data, use strong augmentation, regularization, or smaller models. Inference: resolution & batch Fix input size or use sliding windows. If you hit GPU limits, tune batch/resolution/AMP. Pick the right backbone Compare Swin, ConvNeXt, CNN+ViT hybrids —balance data, latency, and accuracy; ViT is not always the answer. Debug checklist If metrics are bad: check patch size, positions, CLS, pretrained weights, resolution distribution . If OOM: tokens, efficient attention, checkpointing.

Summary

One-liner — ViT cuts an image into patches, maps each patch to a token, and learns patch relations with a Transformer encoder . Link to BERT/GPT: Same encoder stack idea; inputs are patch tokens instead of subwords. vs CNN: Less emphasis on stacked local convolutions; self-attention models global interactions directly (except hybrids). Ops: pretrain/finetune, resolution & patch size, OOM, and Chapter 04 efficient attention tools.

How to approach problems

N

Vision Transformer: Turning Images into Patch Tokens

1. Why patches? Turn pixels into tokens

Concept: Transformers consume ordered tokens

x_1,\ldots,x_N

. ViT tiles the image into patches; each

P\times P

patch (with

C

channels) is flattened and linearly projected to dimension

d

Intuition: Like numbered puzzle pieces that talk to each other to infer the whole scene.

Practice: Larger patches → fewer tokens but less detail; smaller patches → more tokens → Chapter 04-style cost rises.

2. Patch embedding, CLS, positions

Concept: Add learned positional embeddings so the model knows *where* each patch came from. Image classification often prepends a [CLS] token and reads logits from its final hidden state.

Math habit: Let

h_0

be CLS and

h_i=z_i+PE_i

for patch tokens—same picture as BERT.

Practice: Detection/segmentation variants may tokenize differently (queries, pixels).

3. Encoder vs CNN

Concept: A ViT backbone models global patch interactions with self-attention instead of stacking only local convolutions. Hybrid models add a small conv stem.

Intuition: CNNs repeat local windows; dense ViT attention is closer to an “everyone-in-the-room meeting” (for standard ViT).

Practice: With enough data or strong pretraining ViT shines; small data may need augmentation, pretraining, or CNN priors.

4. Training & inference notes

Concept: Usually cross-entropy for classification. Compute/memory grows with the number of patch tokens.

Practice: Raising resolution increases tokens—revisit efficient attention, Swin windows, merging, etc., as in Chapter 04.