Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.07

Vision Models: Local CNN vs Global ViT

CNNs slide a kernel to build local features; ViT turns patches into tokens and applies global self-attention. Window attention is a common way to split that cost.

CNN
ViT
CNN vs ViT — 배경 사진: Unsplash

Learning flow

CNN reads neighborhoods like a moving magnifier; ViT runs a town-hall meeting among patches.

ViT slices an image into patches (like tokens / words) and runs dense self-attention so, even in early blocks, every patch can relate to every other patch—global mixing across the frame. CNNs slide small kernels with weight sharing; each layer typically sees only a local neighborhood while the receptive field widens indirectly as depth grows (not the whole image at once in a single conv).
Hierarchical window attention sits between the two: window attention + shift + merge approximates global context hierarchically without paying the full O(N2)\mathcal{O}(N^2)O(N2) price of swallowing global context in one bite when token count NNN grows.
This chapter lines up CNN vs ViT on local vs global, inductive bias vs flexibility, and compute / memory / data scale so you can choose backbones with intent.

How to read the key formulas (CNN vs ViT)

1) CNN local mixing: At location iii, one neuron mixes a small neighborhood with a kernel WWW:
yi=∑δ∈NkWδ xi+δy_i = \sum_{\delta\in\mathcal{{N}}_k} W_\delta \, x_{i+\delta}yi​=δ∈Nk​∑​Wδ​xi+δ​
(up to padding/boundary details). Shared WWW across locations is the CNN hallmark.
CNNRF ↑ layersvsViTN² attn
The diagram contrasts CNN receptive-field growth with one-block global mixing in ViT.
2) ViT global mixing in one block: Build Q,K,VQ,K,VQ,K,V from patch tokens and apply
Attn(X)=softmax(QKTdk)V\mathrm{{Attn}}(X)=\mathrm{{softmax}}\left(\frac{{QK^T}}{\sqrt{{d_k}}}\right)VAttn(X)=softmax(dk​​QKT​)V
The QKTQK^TQKT term is naturally token×token, so cost tends to scale as O(N2)\mathcal{{O}}(N^2)O(N2) when NNN grows (the usual self-attention cost picture).
3) Two ways to grow "how far you see":
① CNN: stack layers / pooling / stride so the receptive field widens indirectly.
② ViT (dense): compare all positions in one attention block for direct long-range mixing (with higher cost).
4) Hierarchical windows to split N2N^2N2 pain:
① Cost: large NNN makes ViT's N2N^2N2 painful first.
② Compromise: window attention costs like O(NM2)\mathcal{{O}}(N M^2)O(NM2), shifts mix across boundaries, merging shrinks NNN.
③ One-liner: not "global in one gulp," but compose small globals hierarchically.

CNN and ViT: two grammars for the same pixels

1. CNN: slide a magnifying glass (locality)
Intuition: A kernel looks only at a small k×kk\times kk×k neighborhood. The same weights slide across locations (weight sharing), which bakes in strong priors like approximate translation equivariance.
One-line math: Each output location mixes a local patch with learned weights (convolution / cross-correlation up to padding). Going deeper grows the receptive field indirectly.
2. ViT: patches as words, global debate in one block
Intuition: Patches become tokens. Dense self-attention lets every token attend to every other token in one block—powerful for long-range structure, but the score matrix grows like N×NN\times NN×N.
Scale reminder: Attention often scales as O(N2d)\mathcal{{O}}(N^2 d)O(N2d) up to constants (heads, projections, implementations).
3. A hierarchical compromise: not swallowing "global" in one bite
Connection: Attend inside windows, shift windows to mix across boundaries, and merge tokens to reduce NNN. That buys global context hierarchically instead of one giant global meeting every layer.
4. Why compare CNN vs ViT in one chapter?
Data & resolution: CNNs can be strong with modest data; ViTs often shine with large-scale pretraining—but costs rise with NNN.
Tasks: Detection/segmentation still lean on multi-scale designs (pyramids, hierarchical windows, ConvNeXt, FPN).
Bridge: Efficient attention and window/merge hierarchies are the engineering knobs behind the same N2N^2N2 anxiety.

Why it matters

Model choice is a table of assumptions, not a leaderboard row
CNNs encode locality + sharing + hierarchy—strong inductive biases that can stabilize small-data regimes. ViTs relax locality in exchange for flexible global mixing, often trading in more data and compute.
Hybrids are the new normal
Modern stacks mix patch embeddings + CNN stems, window attention + conv downsampling, and ConvNeXt-style depthwise conv blocks. Still, debugging always returns to how far one layer can look.
What you take away from this chapter
This chapter places CNNs (locality, receptive field, weight sharing) next to ViTs (patch tokens, dense self-attention, NNN and N2N^2N2) on the same axis. After it, you can reason in plain language about tokens / patches / resolution when only ViTs OOM, and about inductive bias vs data when choosing a first baseline on small labels.
The goal here is not memorizing trendy backbone names, but carrying a decision frame: how much local vs global you buy for your task, GPU budget, and label scale.

How it is used

1. Where CNNs still rule: on-device, edge, and real-time
Because they are light and fast, CNNs dominate workloads that must run immediately on the device without round-tripping a heavy cloud model.
* Phone camera filters & face unlock (Face ID): You need low battery and ~0.1s to find facial contours—local cues—so lightweight CNNs like MobileNet are the default.
* Factory conveyor defect detection: When dozens of parts pass every second, real-time scratch/dent checks fit CNN detectors (e.g., YOLO family) extremely well.
* Dashcams & low-latency ADAS: If a pedestrian jumps out, ~10ms perception-to-brake favors a fast CNN stack over a transformer that might stall under tight latency.
2. Where ViTs shine: huge AI, generative models, multimodal
ViTs sit where teams can spend serious server/GPU budget and need deep, holistic understanding. They share the transformer recipe with LLMs, so pairing text + vision is natural.
* ChatGPT vision (GPT-4o / GPT-4V-style): Upload a receipt photo and it reads/summarizes—language and pixels live in one transformer space. A ViT-style image encoder is the “eyes.”
* Midjourney, DALL·E, etc.: Prompt like *“an astronaut cat smoking a Marlboro”* flows through diffusion + transformer backbones (e.g., DiT) where ViT-like tokens keep global composition coherent.
* Medical & satellite analytics: Metastasis patterns or terrain change over a wide tile often need macro context beyond a single pixel—global mixing helps.
3. Production hybrids (CNN + ViT): the practical sweet spot
Most product teams blend both, not pick a purity trophy.
* Pattern: Feed high-res frames through CNN early stages to downsample quickly and extract edges/textures, then attach ViT/transformer blocks deeper to reason about long-range context on a smaller token grid.
* Examples: Google CoAtNet and Apple MobileViT popularized this recipe to ship transformer benefits on mobile-class hardware.

Summary

CNNs stack local kernels with weight sharing, widening the receptive field layer by layer so long-range context is built indirectly. ViTs turn images into patch tokens and use dense self-attention for global mixing in a block, but cost tends to blow up with N2N^2N2 as NNN grows.
Window attention with shift/merge is a common hierarchical way to split that N2N^2N2 pain. When picking a backbone, weigh data & pretraining scale, resolution / patch size / NNN, whether you need dense prediction (detection/segmentation), and latency / GPU memory together.

Notes for problem solving

Start by reading like this
- First classify the axis: CNN locality & receptive field / ViT patches & global attention / data & inductive bias
- For calculations, fix units: tokens NNN, input H×WH\times WH×W, patch PPP → grid ≈(H/P)(W/P)\approx (H/P)(W/P)≈(H/P)(W/P)
- When you see 'N2N^2N2', think dense token×token score matrices

Example (concept)
"In CNNs, applying the same kernel at every spatial location is best described as?"
① global attention
② weight sharing
③ patch merging
④ CLS token
Answer 2
Why? Weight sharing is the CNN default—different axis from ViT patch tokens.

Example (T/F)
"A typical first CNN conv directly sees the whole image"
Answer 0 (false)
Why? First layers are local; wide context emerges deeper.

Example (calc)
"If N=10N=10N=10, what is N2N^2N2?" → 100
Examples by type + reasoning
Example (scenario)
"ViT OOM at high resolution—first move?"
① increase batch only
② change tokens (resolution/patch/batch)
③ add only global blocks
Answer 2
Why? Reduce memory pressure directly.

Example (numeric MC)
"224×224224\times224224×224, patch 16×1616\times1616×16: patch token count (grid only)?"
① 14
② 196
③ 224
④ 3136
Answer 2 (value 196)
Why? (224/16)2=142=196(224/16)^2=14^2=196(224/16)2=142=196.

Example (concept pick)
"Small labels, need a quick stable baseline?"
① huge ViT scratch
② small CNN + augmentation
③ train with no data
Answer 2
Why? Locality bias stabilizes learning.

Example (integrated)
"Same aug/res, only ViT OOMs—next check?"
① optimizer
② patch/sequence/checkpointing
③ log encoding
Answer 2