Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.14

Master CNNs: Kernels, Stride, Padding & Backbone Evolution

Ch.12

Basics → modern backbones

Picture features flowing left to right—conv, pooling, then residual adds.

Kernels sweep the image to build features, pooling summarizes, then classification follows. In deep blocks, residual F(x) and skip x add to y.

InputConvPoolFCOutput
A basic CNN is one of the greatest inventions that taught AI to see images. This chapter carefully unpacks the heart of CNNs—how kernels scan an image, how stride sets the step size, and how padding pads the border.
You will see how these elementary blocks assemble into architectures like ResNet—so deep layers still train stably—and into modern giants like ConvNeXt, mastering the full story from basics to backbone.

How to read the formulas (kernel, stride, padding)

1. Output side length
For input III, kernel KKK, padding PPP, stride SSS:
O=I−K+2PS+1O = \frac{I - K + 2P}{S} + 1O=SI−K+2P​+1
2. Stride & padding
Larger stride overlaps less, so maps shrink faster. Padding reduces under-sampling at borders.
3. Standard conv parameters (bias included)
(K×K×Cin+1)×Cout(K \times K \times C_{in} + 1) \times C_{out}(K×K×Cin​+1)×Cout​
4. Residuals & modern backbones
Skips like y=F(x)+xy=F(x)+xy=F(x)+x stabilize very deep CNNs; ResNet, ConvNeXt, and friends build on this idea.

Modern CNN architectures: from conv basics to ResNet

1. Kernel / filter: a magnifying-glass stamp for features
Idea: A kernel is usually a small tile of numbers—a 3×33\times33×3 or 5×55\times55×5 matrix. It slides across the raw image; where it overlaps, it multiplies aligned pixels and sums them (convolution) to compute whether there is a horizontal edge, a sharp color boundary, and so on.
Intuition: Like *Where’s Waldo?*—you do not scan the whole poster at once; you sweep with a small magnifier, patch by patch. When the magnifier (kernel) matches what you hunt (e.g. red stripes), it cries “found it!” and stamps a high activation there.
2. Stride and padding: step size and a safety cushion
Stride: The step size—how many pixels the magnifier (kernel) moves at once. 1 is dense; 2 skips positions. A larger stride scans the same-sized input faster, so work per step feels lighter, but width and height of the feature map shrink sooner.
Padding: Stacked convolutions visit border pixels less often than the center, so edge information is under-represented, and without padding resolution keeps dropping. We pad the border with zero pixels—zero padding—like a cushion, to hold size and let edges get stamped fairly.
3. Pooling: compress to essentials
Idea: From many convolution responses, we mostly keep the strongest values and greatly reduce spatial size. Usually we take the max in each window—max pooling (sometimes average pooling too).
Intuition: In a noisy meeting, you ignore faint chatter (background, weak activations) and only record the loudest, clearest voice (strong features) in the minutes—less data, but the cues you need for decisions remain.
4. Channels and 3D convolution: flaky pastry–style stacks
Idea: Grayscale uses one channel; RGB stacks R, G, B like three thin films—three channels. So the kernel is not a flat tile but a 3D block as deep as the input, reading one location at a time across channels. As layers deepen, width and height shrink while filters (output channels) that encode different patterns stack—64, 128, 256…—like layered flaky pastry getting thicker.

Why it matters

Parameter sharing (maximizing efficiency)
Fully connected layers would give different weights to every pixel, so parameters explode as resolution grows. A CNN reuses one kernel (magnifier) across the whole map—like hiring a cleaner per floor tile versus one person with one broom for the whole room. That cuts trainable parameters and FLOPs together, so vision models can still run on smartphones and live CCTV under tight budgets.
Translation invariance
Whether the dog sits nicely in the center or is tucked in the corner, the CNN still seeks the same “dog” cues. It does not memorize fixed coordinates; the kernel scans the whole image for the pattern. With pooling, small pixel jitter or shift is smoothed toward robust responses—close to robust object recognition in practice. (This still does not mean full invariance to rotation and scale.)
Hierarchical features (part to whole)
Shallow layers catch lines, corners, and coarse texture; deeper layers chain them into circles and contours, then into parts like a dog’s eyes or car wheels. It matches building Lego from small bricks. The same backbone extends from classification to detection (bounding boxes) and segmentation, and aligns with hierarchical processing in the visual cortex.
The common language of vision—and core literacy
In PyTorch and TensorFlow, `Conv2d` is a module you use constantly. Even when transformers like ViT (Ch. 05) are popular, convolutions still appear in stems and hybrid models when images are tiled or local features are extracted. ImageNet pretrained weights often seed papers and products. Mastering CNN grammar—how images become numbers—lets you read, modify, and adapt open-source code and new papers to your project.

How it is used

Step 1: Feature extraction — clarifying muddy water
The image runs through a conveyor of [Conv (find features) →\rightarrow→ activation (e.g. ReLU, drop negatives) →\rightarrow→ (optional) BatchNorm →\rightarrow→ pooling (summarize)] several times. Early layers filter coarse grit (simple edges and colors); later layers strip finer noise and keep more abstract shapes. Background clutter falls away; feature maps show which salient features exist where.
Step 2: Flatten — lining up Lego bricks
After conv blocks, the stacked 3D feature map is unrolled into one long vector per image—like disassembling a 3D Lego build and laying bricks in a single row. The batch dimension (batch size) stays; only per-image channel and spatial axes are concatenated so fully connected layers can emit class scores.
Step 3: Fully connected — the judge’s verdict
The vector passes through Dense layers that output a logit per class. For multi-class tasks you usually apply softmax for probabilities and train with cross-entropy. “This photo is 95% cat” is a human-readable wrapper around this stage’s output.
Step 4: Transfer learning and fine-tuning
You often do not train a hundred-layer CNN from scratch. You load ResNet, EfficientNet, and other pretrained backbones (“premium skeletons”) trained on millions of images. With little data, freeze the backbone and replace only the classification head (the judge) for your task, then retrain. Input size (e.g. 224×224224 \times 224224×224) and RGB normalization (e.g. ImageNet mean/std) must match pretraining. With more data, unfreeze upper blocks and fine-tune. Even ViT / hybrid setups still lean on convolution extracting local structure.

Summary

One-liner: A CNN finds patterns with kernels, moves with stride, protects borders with padding, and summarizes with pooling.
Connection: Those building blocks evolve into residual connections (ResNet) and modern CNNs like ConvNeXt—deep stacks without losing signal.
Practice tip: The most common debugging pain is shape mismatch. Track output size layer by layer and watch spatial size vs. channels—that habit levels up your code fast.

Problem-solving notes

CNN exam questions usually hinge on output feature-map size and parameter counts. Nail those two patterns and you can dissect even large modern architectures.
Key formula 1: output side length
With input size III, kernel KKK, padding PPP, stride SSS, the output side length OOO is
O=I−K+2PS+1O = \frac{I - K + 2P}{S} + 1O=SI−K+2P​+1
Example: 32×3232\times3232×32 input, 3×33\times33×3 kernel, padding 1, stride 1 gives O=32−3+2(1)1+1=32O = \frac{32 - 3 + 2(1)}{1} + 1 = 32O=132−3+2(1)​+1=32—size stays the same because padding compensates shrinkage.
Key formula 2: parameters (with bias)
For square K×KK\times KK×K kernels, input channels CinC_{in}Cin​, output channels CoutC_{out}Cout​, one bias per output channel:
(K×K×Cin+1)×Cout(K \times K \times C_{in} + 1) \times C_{out}(K×K×Cin​+1)×Cout​
Example: 3×33\times33×3, Cin=3C_{in}=3Cin​=3 (RGB), Cout=16C_{out}=16Cout​=16: (3⋅3⋅3+1)⋅16=448(3\cdot3\cdot3+1)\cdot16=448(3⋅3⋅3+1)⋅16=448 parameters.
Concept drill
“Which is closest to why we use padding?
① increase compute for fun
② reduce border loss and keep size
③ change color space”
→ Padding cushions borders so edges are not under-sampled. → Answer 2

Concept drill 2
“If stride goes from 1 to 2, the spatial size of the feature map usually…”
→ Steps skip more, so height/width roughly halve.