Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.06

Swin Transformer: Hierarchical Windows and Global Context

Swin Transformer

Patch Merging

Patches are small tiles; windows set how far attention looks at once. W keeps attention inside the window; SW shifts the window so neighbors briefly connect. Merging halves height and width while stacking deeper features.

Split a 4×44\times44×4 grid into four checkerboard-like 2×22\times22×2 groups, then concatenate matching positions to get temporary C→4CC\rightarrow4CC→4C (preserve information first). After that, apply LayerNorm → Linear to compress key features 4C→2C4C\rightarrow2C4C→2C. In short: spatial size shrinks, semantic channel depth is retained via compressed stacking.
ConcatLayerNormLinear
H4×W4×C\frac{H}{4} \times \frac{W}{4} \times C4H​×4W​×C
H2×W2×4C\frac{H}{2} \times \frac{W}{2} \times 4C2H​×2W​×4C
H2×W2×2C\frac{H}{2} \times \frac{W}{2} \times 2C2H​×2W​×2C
The ViT (Vision Transformer) from Chapter 05 is like a giant town-hall debate where all image pixels discuss everything at once (global attention). Accuracy is strong, but as participants (resolution) increase, memory cost explodes quadratically (O(N2)\mathcal{O}(N^2)O(N2))—a critical weakness.
Swin Transformer (Shifted Window Transformer) solves this elegantly. First, it lets tokens talk only inside small meeting rooms (windows). In the next layer, it slightly moves the partitions (shift) so neighboring groups can exchange information. Then it gradually compresses four workers into one manager (Patch Merging) to build a bigger-picture view. In this chapter, we study how this divide-and-conquer intuition cuts computation so dramatically in math terms.

How to read the key formulas (Swin)

1) W-MSA (small-room rule): If window side length is MMM, attention inside one window handles roughly M×MM\times MM×M relations. Instead of one global meeting, computation is split into many small rooms.
PatchW-MSAlocalShiftSW-MSA+ maskmerge ↓N
The diagram shows patch → window attention (W/SW) → patch merging.
2) SW-MSA (moving partitions): In the next block, shift windows by ⌊M/2⌋\lfloor M/2 \rfloor⌊M/2⌋ and keep only valid query-key links with an attention mask. Neighboring windows can exchange information while invalid links are blocked.
3) Patch Merging (information-preservation view):
① Concatenate four 2×22\times22×2 patches (each with CCC channels) at one location, giving temporary 4C4C4C (stack first, do not discard).
② Then Linear compresses key information to 2C2C2C.
So resolution drops to (H2,W2)\left(\frac{H}{2},\frac{W}{2}\right)(2H​,2W​) while channel depth increases to reduce information loss.
4) Why resolution↓ and channels↑? (deeper intuition):
① Information-capacity view
A rough capacity proxy is (H×W)×C(H\times W)\times C(H×W)×C. If spatial size becomes 1/4 and channels stay fixed, representational capacity drops too sharply, and fine detail is lost. Increasing channels (4C→2C4C\rightarrow2C4C→2C) buffers this loss.
② Compute-vs-expression balance
Keeping high resolution is expensive; blindly increasing channels is also expensive. Swin reduces token count first (memory/speed gain), then deepens per-token semantics via channels.
③ Context/receptive-field view
Deeper stages should model larger structures and long-range relations. Lower resolution expands effective context; more channels provide room to encode richer semantic factors.
④ Practical one-liner
Only reducing resolution causes "amnesia," only increasing channels causes "compute explosion." Swin uses resolution↓ + channels↑ as an engineering trade-off to control both.

Swin Transformer: where intuition meets equations

1. Window Attention (W-MSA): putting computation on a diet in small rooms
Metaphor: If tens of thousands of people all talk at once (ViT), it's chaos. Instead, split the image into 7x7-size meeting rooms (windows), and let attention happen only among tokens in the same room.
Core equation: Let total patch count be NNN and window side length be MMM. ViT cost grows as O(N2)\mathcal{O}(N^2)O(N2). W-MSA becomes O(N⋅M2)\mathcal{O}(N \cdot M^2)O(N⋅M2). Since MMM is a fixed small constant (often 7), overall cost scales almost linearly with NNN, i.e., practical O(N)\mathcal{O}(N)O(N) behavior.
2. Shifted Window (SW-MSA): twist partitions to communicate
Metaphor: If everyone talks only to their own room forever, they become siloed. So in the next layer, partitions are shifted by half a window (down-right). Patches that were in neighboring rooms now share one room and exchange information.
Core equation: Mathematically, windows are shifted by ⌊M2⌋\lfloor \frac{M}{2} \rfloor⌊2M​⌋ patches. Boundary fragments are handled cleanly with cyclic shift plus an attention mask, so invalid interactions are blocked without large extra memory.
3. Patch Merging: four workers become one manager
Metaphor: As depth increases, fine-grained local opinions are summarized into broader context. Swin groups neighboring 2×22 \times 22×2 patches (4 total) into one unit.
Core equation: Resolution halves to (H2,W2)\left(\frac{H}{2}, \frac{W}{2}\right)(2H​,2W​), while channel width temporarily becomes 4C4C4C. A Linear layer compresses it to 2C2C2C. So across stages, spatial size shrinks while semantic depth grows—a multi-scale pyramid.
4. Two-Successive Blocks: always run as a pair
Concept: Swin blocks work in 2-step combos: (regular window → shifted window).
* zl=W-MSA(… )z^l = \text{W-MSA}(\dots)zl=W-MSA(…) : step 1, talk inside your room
* zl+1=SW-MSA(… )z^{l+1} = \text{SW-MSA}(\dots)zl+1=SW-MSA(…) : step 2, shift partitions and talk across neighbors

Why it matters

An engineering win over heavy O(N2)O(N^2)O(N2) math
With ViT, once resolution rises (e.g., beyond 256x256), quadratic attention quickly becomes impractical on normal GPUs. Swin constrains this to near-linear O(N)O(N)O(N) scaling, opening the door to training 4K driving video and giga-pixel pathology slides (WSI) without constant OOM failures.
A paradigm shift for dense prediction
Image classification can already work with ViT. But object detection and semantic segmentation require robust multi-scale pyramids because object sizes vary greatly. Swin brings this pyramid behavior into transformers through patch merging, replacing CNN backbones (e.g., ResNet) in many SOTA systems.
Toward a unified visual backbone
Instead of "ViT for classification, ResNet for detection," Swin enabled one transformer-family backbone to serve classification, detection, and segmentation together, reducing maintenance and research overhead in production.

How it is used

1. Debugging rule #1: keep input size a multiple of 32
The most common real-world Swin error is shape mismatch. Patch merging is repeatedly applied (2×22 \times 22×2), so resolution keeps halving: 1/2→1/4→1/8→1/16→1/321/2 \rightarrow 1/4 \rightarrow 1/8 \rightarrow 1/16 \rightarrow 1/321/2→1/4→1/8→1/16→1/32. Input height/width should therefore be divisible by 32 (e.g., 224, 256, 512, 1024). Otherwise add zero-padding or resize first.
2. In practice, start from fine-tuning—almost always
Training Swin from scratch on a small in-house dataset is usually inefficient and prone to overfitting. Standard workflow: load pretrained backbones (Swin-T, Swin-B) from Hugging Face/MMDetection, replace only the task head, and fine-tune.
3. OOM survival guide
Even with near-linear complexity, Swin is still heavy. If memory crashes:
1) reduce batch size first,
2) lower input resolution (e.g., 1024 → 512),
3) reduce window size MMM (e.g., 7 → 5).
Bonus: `gradient_checkpointing=True` can save substantial VRAM at a speed cost.

Summary

💡 [Core cheat sheet]
* Operating philosophy: split what can be split (W-MSA), reconnect what is disconnected by shifting partitions (SW-MSA), and aggregate four patches into one as depth grows (Patch Merging).
* Why it matters in vision AI: Swin cures the transformer's high-resolution Achilles heel (from O(N2)O(N^2)O(N2) toward practical O(N)O(N)O(N)) while inheriting CNN-style multi-scale pyramids.
* Production mindset: "Use input sizes divisible by 32, start from pretrained weights, and tune window size MMM plus batch when memory is tight."

Notes for problem solving

Common Swin problem-solving frame (check these 3 lines first)
1) Is this asking about W-MSA / SW-MSA / Patch Merging?
2) If it is a calculation, identify units first: token count NNN, window size MMM, channel CCC, resolution (H,W)(H,W)(H,W).
3) If merging appears, apply the default transform immediately:
* Spatial: (H,W)→(H2,W2)(H,W)\rightarrow(\frac{H}{2},\frac{W}{2})(H,W)→(2H​,2W​)
* Channel: C→4C→2CC\rightarrow4C\rightarrow2CC→4C→2C

Example (Concept)
"Which is NOT core to Swin?" → fixed full global attention every layer is not core.

Example (True/False)
"Patch merging increases spatial resolution." → false. Answer 0

Example (Scenario)
"High-resolution OOM occurs. What to try first?" → check batch↓, resolution↓, window size(MMM)↓ in order.
Quick examples by problem type (same layout as Ch05, Swin-specific content)
Example (Multiple-choice calculation)
"If N=20N=20N=20, what is global attention N2N^2N2?" → 202=40020^2=400202=400.

Example (Cumulative merging)
"Resolution after 3 patch-merging operations?" → each step halves side length, so 1/81/81/8: (H8,W8)(\frac{H}{8},\frac{W}{8})(8H​,8W​).

Example (Configuration/counting)
"Square grid with 8 windows per side: total windows?" → 82=648^2=6482=64.

Example (Integrated reasoning)
"Input 224×224224\times224224×224, patch 4×44\times44×4, tokens after one merging?"
1) Initial tokens: (224/4)2=562=3136(224/4)^2=56^2=3136(224/4)2=562=3136
2) After one merge: 3136/4=7843136/4=7843136/4=784
→ Answer 784

One practical tip
For calculation problems, first separate what becomes 1/2 and what becomes 2x. (Spatial shrinks, channel depth grows.)