Chapter 10
Width (Number of Neurons per Layer)
Having many neurons in a single layer.
Deep learning diagram by chapter
As you complete each chapter, the diagram below fills in. This is the structure so far.
The number of neurons in one layer is the width. Wider layers can handle more features at once.
Width in deep learning
Width refers to how many neurons are in a single layer. More neurons (wider) = the layer can represent more features simultaneously. For example, 1 neuron = 1 feature; 256 neurons = 256 features at once.
Analogy: if an exam has 1 question, you can only evaluate one skill; with 100 questions, you can assess many abilities at once. Similarly, a wider layer processes more diverse information in one step.
Layers can have different widths. For example, '1 → 2 → 4 → 8' (widening) or '256 → 128 → 64' (narrowing) are both common designs, depending on the purpose.
Depth (number of layers) and width (neurons per layer) together determine the model's total size (parameter count). With the same number of parameters, you can choose 'deep and narrow' or 'shallow and wide'—and this choice significantly affects performance.
Greater width means more features processed simultaneously per layer, but it also increases computation and memory. Too wide risks overfitting (memorizing training data).
In practice, bottleneck designs are popular: keep the input and output narrow but make the middle wide. This way, the wide layer extracts key features while the rest stays compressed. Both ResNet and Transformer use this technique.
Image recognition (CNN): The channel count (number of feature maps) at each layer is its width. Starting from 3 channels (RGB), deeper layers grow to 64 → 128 → 256 → 512 channels, extracting increasingly diverse features.
Chatbots & translators (Transformer): The hidden dimension (e.g., 768, 1024, 4096) is the number of numbers each layer processes at once (its width). Large models like GPT-4 have dimensions in the thousands—very wide.
Recommendation systems: A 'user vector of 256 dimensions' means width 256. It holds 256 features (age, preferences, watch history, etc. transformed into numbers), enabling more detailed recommendations.
Same formula per layer even when widening: Linear (W·input+b) → ReLU. Find which layer and neuron the blank belongs to, then use that layer's input and the corresponding row of W and entry of b to compute.
Watch W dimensions: When width changes between layers, W's size changes too. W is (current width × previous width), so find the right row for the blank's neuron and dot it with the previous layer's output, plus b.
Layer by layer: Just like with depth problems, compute previous layers' outputs first before moving to the next. Don't forget ReLU (negative → 0) at each layer.
Width means having many neurons in one layer. Wider layers can express more features at once; each layer is computed with Linear & ReLU.
Layer 1 (width 2): H = ReLU(W₁·X + b₁)
Layer 2 (width 4): Y = ReLU(W₂·H + b₂)
Problem
In the forward pass where layers get wider (each layer Linear & ReLU), fill in the blank (?).