Everyone's AI
Machine learningAI papers
Loading...

Learn

🏅My achievements

Ch.06

Swin Transformer: Hierarchical Windows and Global Context

Swin Transformer

Patch Merging

Patches are small tiles; windows set how far attention looks at once. W keeps attention inside the window; SW shifts the window so neighbors briefly connect. Merging halves height and width while stacking deeper features.

Split a 4×44\times44×4 grid into four checkerboard-like 2×22\times22×2 groups, then concatenate matching positions to get temporary C→4CC\rightarrow4CC→4C (preserve information first). After that, apply LayerNorm → Linear to compress key features 4C→2C4C\rightarrow2C4C→2C. In short: spatial size shrinks, semantic channel depth is retained via compressed stacking.
ConcatLayerNormLinear

입력 패치(머징 전)

H4×W4×C\frac{H}{4} \times \frac{W}{4} \times C4H​×4W​×C

Concat → LayerNorm

H2×W2×4C\frac{H}{2} \times \frac{W}{2} \times 4C2H​×2W​×4C

LayerNorm → Linear

H2×W2×2C\frac{H}{2} \times \frac{W}{2} \times 2C2H​×2W​×2C

The ViT (Vision Transformer) from Chapter 05 is like a giant town-hall debate where all image pixels discuss everything at once (global attention). Accuracy is strong, but as participants (resolution) increase, memory cost explodes quadratically (O(N2)\mathcal{O}(N^2)O(N2))—a critical weakness.
Swin Transformer (Shifted Window Transformer) solves this elegantly. First, it lets tokens talk only inside small meeting rooms (windows). In the next layer, it slightly moves the partitions (shift) so neighboring groups can exchange information. Then it gradually compresses four workers into one manager (Patch Merging) to build a bigger-picture view. In this chapter, we study how this divide-and-conquer intuition cuts computation so dramatically in math terms.

How to read the key formulas (Swin)

1) W-MSA (small-room rule): If window side length is MMM, attention inside one window handles roughly M×MM\times MM×M relations. Instead of one global meeting, computation is split into many small rooms.
PatchW-MSAlocalShiftSW-MSA+ maskmerge ↓N
The diagram shows patch → window attention (W/SW) → patch merging.
2) SW-MSA (moving partitions): In the next block, shift windows by ⌊M/2⌋\lfloor M/2 \rfloor⌊M/2⌋ and keep only valid query-key links with an attention mask. Neighboring windows can exchange information while invalid links are blocked.
3) Patch Merging (information-preservation view):
① Concatenate four 2×22\times22×2 patches (each with CCC channels) at one location, giving temporary 4C4C4C (stack first, do not discard).
② Then Linear compresses key information to 2C2C2C.
So resolution drops to (H2,W2)\left(\frac{H}{2},\frac{W}{2}\right)(2H​,2W​) while channel depth increases to reduce information loss.
4) Why resolution↓ and channels↑? (deeper intuition):
① Information-capacity view
A rough capacity proxy is (H×W)×C(H\times W)\times C(H×W)×C. If spatial size becomes 1/4 and channels stay fixed, representational capacity drops too sharply, and fine detail is lost. Increasing channels (4C→2C4C\rightarrow2C4C→2C) buffers this loss.
② Compute-vs-expression balance
Keeping high resolution is expensive; blindly increasing channels is also expensive. Swin reduces token count first (memory/speed gain), then deepens per-token semantics via channels.
③ Context/receptive-field view
Deeper stages should model larger structures and long-range relations. Lower resolution expands effective context; more channels provide room to encode richer semantic factors.
④ Practical one-liner
Only reducing resolution causes "amnesia," only increasing channels causes "compute explosion." Swin uses resolution↓ + channels↑ as an engineering trade-off to control both.

Swin Transformer: where intuition meets equations

1. Window Attention (W-MSA): putting computation on a diet in small rooms
Metaphor: If tens of thousands of people all talk at once (ViT), it's chaos. Instead, split the image into 7x7-size meeting rooms (windows), and let attention happen only among tokens in the same room.
Core equation: Let total patch count be NNN and window side length be MMM. ViT cost grows as O(N2)\mathcal{O}(N^2)O(N2). W-MSA becomes O(N⋅M2)\mathcal{O}(N \cdot M^2)O(N⋅M2). Since MMM is a fixed small constant (often 7), overall cost scales almost linearly with NNN, i.e., practical O(N)\mathcal{O}(N)O(N) behavior.
2. Shifted Window (SW-MSA): twist partitions to communicate
Metaphor: If everyone talks only to their own room forever, they become siloed. So in the next layer, partitions are shifted by half a window (down-right). Patches that were in neighboring rooms now share one room and exchange information.
Core equation: Mathematically, windows are shifted by ⌊M2⌋\lfloor \frac{M}{2} \rfloor⌊2M​⌋ patches. Boundary fragments are handled cleanly with cyclic shift plus an attention mask, so invalid interactions are blocked without large extra memory.
3. Patch Merging: four workers become one manager
Metaphor: As depth increases, fine-grained local opinions are summarized into broader context. Swin groups neighboring 2×22 \times 22×2 patches (4 total) into one unit.
Core equation: Resolution halves to (H2,W2)\left(\frac{H}{2}, \frac{W}{2}\right)(2H​,2W​), while channel width temporarily becomes 4C4C4C. A Linear layer compresses it to 2C2C2C. So across stages, spatial size shrinks while semantic depth grows—a multi-scale pyramid.
4. Two-Successive Blocks: always run as a pair
Concept: Swin blocks work in 2-step combos: (regular window → shifted window).
* zl=W-MSA(… )z^l = \text{W-MSA}(\dots)zl=W-MSA(…) : step 1, talk inside your room
* zl+1=SW-MSA(… )z^{l+1} = \text{SW-MSA}(\dots)zl+1=SW-MSA(…) : step 2, shift partitions and talk across neighbors

Why it matters

An engineering win over heavy O(N2)O(N^2)O(N2) math
With ViT, once resolution rises (e.g., beyond 256x256), quadratic attention quickly becomes impractical on normal GPUs. Swin constrains this to near-linear O(N)O(N)O(N) scaling, opening the door to training 4K driving video and giga-pixel pathology slides (WSI) without constant OOM failures.
A paradigm shift for dense prediction
Image classification can already work with ViT. But object detection and semantic segmentation require robust multi-scale pyramids because object sizes vary greatly. Swin brings this pyramid behavior into transformers through patch merging, replacing CNN backbones (e.g., ResNet) in many SOTA systems.
Toward a unified visual backbone
Instead of "ViT for classification, ResNet for detection," Swin enabled one transformer-family backbone to serve classification, detection, and segmentation together, reducing maintenance and research overhead in production.

How it is used

1. Autonomous driving and smart CCTV (king of dense prediction)
A 4K driving camera mixes tiny distant pedestrians (few pixels) and huge nearby trucks (many pixels) in one frame. Thanks to patch merging (pyramidal hierarchy), Swin can separate boundaries at pixel level across both tiny and large objects in segmentation-heavy pipelines. This is why it is a core backbone in next-generation driving vision stacks.
2. Gigapixel-scale medical and satellite imaging
Pathology slides (WSI) and Earth-observation satellite imagery can reach tens of thousands of pixels per side. Many setups that were impractical with older dense ViT due to memory can be handled by Swin's window-based processing. It captures local cell morphology (inside windows) while preserving global tissue/terrain context through patch merging.
3. "Magic" image restoration (SwinIR)
Swin is widely used in super-resolution and restoration: turning blocky low-res images into sharp 4K outputs, or denoising old photographs. It refines local detail by referencing nearby pixels inside windows, then uses shifted windows to align texture with global scene context so outputs look natural.
4. The ecosystem's "universal heart" (backbone)
In production, Swin is often used as a replaceable system component rather than a standalone model. Teams swap classic backbones (e.g., ResNet) inside detection/segmentation frameworks such as Mask R-CNN or YOLO with Swin to boost end-to-end performance significantly.

Summary

💡 [Core cheat sheet]
* Operating philosophy: split what can be split (W-MSA), reconnect what is disconnected by shifting partitions (SW-MSA), and aggregate four patches into one as depth grows (Patch Merging).
* Why it matters in vision AI: Swin cures the transformer's high-resolution Achilles heel (from O(N2)O(N^2)O(N2) toward practical O(N)O(N)O(N)) while inheriting CNN-style multi-scale pyramids.
* Production mindset: "Use input sizes divisible by 32, start from pretrained weights, and tune window size MMM plus batch when memory is tight."

Notes for problem solving

Common Swin problem-solving frame (check these 3 lines first)
1) Is this asking about W-MSA / SW-MSA / Patch Merging?
2) If it is a calculation, identify units first: token count NNN, window size MMM, channel CCC, resolution (H,W)(H,W)(H,W).
3) If merging appears, apply the default transform immediately:
* Spatial: (H,W)→(H2,W2)(H,W)\rightarrow(\frac{H}{2},\frac{W}{2})(H,W)→(2H​,2W​)
* Channel: C→4C→2CC\rightarrow4C\rightarrow2CC→4C→2C

Example (Concept)
"Which is NOT core to Swin?" → fixed full global attention every layer is not core.

Example (True/False)
"Patch merging increases spatial resolution." → false. Answer 0

Example (Scenario)
"High-resolution OOM occurs. What to try first?" → check batch↓, resolution↓, window size(MMM)↓ in order.
Quick examples by problem type (same layout as Ch05, Swin-specific content)
Example (Multiple-choice calculation)
"If N=20N=20N=20, what is global attention N2N^2N2?" → 202=40020^2=400202=400.
Example (Cumulative merging)
"Resolution after 3 patch-merging operations?" → each step halves side length, so 1/81/81/8: (H8,W8)(\frac{H}{8},\frac{W}{8})(8H​,8W​).
Example (Configuration/counting)
"Square grid with 8 windows per side: total windows?" → 82=648^2=6482=64.
Example (Integrated reasoning)
"Input 224×224224\times224224×224, patch 4×44\times44×4, tokens after one merging?"
1) Initial tokens: (224/4)2=562=3136(224/4)^2=56^2=3136(224/4)2=562=3136
2) After one merge: 3136/4=7843136/4=7843136/4=784
→ Answer 784
One practical tip
For calculation problems, first separate what becomes 1/2 and what becomes 2x. (Spatial shrinks, channel depth grows.)