Chapter 09

Depth (Deep Network)

Having many hidden layers; the 'deep' in deep learning.

Deep learning diagram by chapter

As you complete each chapter, the diagram below fills in. This is the structure so far.

Deep = many hidden layers (middle steps). The “deep” in deep learning is this depth.

X1X2X3A1A2A3B1B2B3C1C2C3D1D2D3Y1Y2Y3XLayer 1ALayer 2BLayer 3CLayer 4DLayer 5YLayer 6

More steps mean a deeper network. Deeper networks can learn more refined patterns.

Depth in deep learning

Deep means having many hidden layers (intermediate stages). The 'deep' in deep learning refers exactly to this depth! Each layer does linear (W·input+b) + activation (ReLU), then passes the result to the next layer.

X → A → B → C → … → Y—the more stages, the deeper. Analogy: with 1 stage you can only 'draw a line,' with 10 stages you can 'draw simple shapes,' and with 100 stages you can 'draw a human face.' More depth = more precise, complex patterns.

But deeper isn't always better. Too many layers can cause vanishing gradients (learning signals don't reach early layers) or overfitting (memorizing training data instead of learning general patterns).

More layers enable more complex functions. Each layer's activation adds 'bends,' and stacking layers combines many bends into very complex curves and decision boundaries.

In image recognition: layers 1–2 learn 'lines, edges,' layers 3–5 learn 'eyes, noses, wheels,' layer 6+ learn 'dogs, cars.' This is possible because of depth.

Famous architectures like ResNet and Transformer can be dozens to hundreds of layers deep and still train well. The secret is skip connections (residual connections): gradients can skip layers and flow directly to earlier layers. These techniques overcome the 'limits of depth.'

ChatGPT: GPT-4 consists of dozens to hundreds of Transformer blocks. Each block understands context more deeply, and the final layer generates the answer.

Self-driving cars: Camera images go through deep networks (e.g., ResNet-152, 152 layers!) to accurately distinguish obstacles, lane markings, and signs through many stages. Depth enables handling complex road situations.

Speech recognition & translation: Converting speech to text, or Korean to English, also goes through deep networks where each layer progressively captures 'phonemes → words → context → meaning.'

Example: Input X = [3, 1, 2]. Layer 1: W₁·X+b₁ = [4, -1, 2] (linear), then ReLU gives A = [4, 0, 2]. Layer 2: W₂·A+b₂ = [2, 1, 5], ReLU gives B = [2, 1, 5]. If A₂ is blank?

Solution: The second entry of layer 1 linear output is -1, so ReLU(-1) = 0. So A₂ = 0. For a blank in a middle layer, compute that layer's linear (W·input+b) first, then apply ReLU (negative → 0).

In general: Wherever the blank is, compute all previous layers in order to get that layer's input, then take the dot product of the corresponding row of W with the input, add the bias entry, and apply ReLU to get the answer.

Input X
3
1
2
Linear & ReLU (layer 1)
W₁
1
0
1
0
1
-1
1
-1
0
b
-1
0
0
Linear
4
-1
2
ReLU
A
4
0
2
Linear & ReLU (layer 2)
W₂
0
0
1
0
1
0
1
0
2
b
0
1
-3
Linear
2
1
5
ReLU
B
2
1
5

Layer 1: A₁, A₂, A₃ (W₁ each row·X + b₁)

A1 = (W₁ row 1·X)+b₁[0] = (1×3+0×1+1×2)+(-1) = 4 → ReLU = 4
A2 = (W₁ row 2·X)+b₁[1] = (0×3+1×1+-1×2)+(0) = -1 → ReLU(-1)=0 → 0
A3 = (W₁ row 3·X)+b₁[2] = (1×3+-1×1+0×2)+(0) = 2 → ReLU = 2

Layer 2: B₁, B₂, B₃ (W₂ each row·A + b₂)

B1 = (W₂ row 1·A)+b₂[0] = (0×4+0×0+1×2)+(0) = 2 → ReLU = 2
B2 = (W₂ row 2·A)+b₂[1] = (0×4+1×0+0×2)+(1) = 1 → ReLU = 1
B3 = (W₂ row 3·A)+b₂[2] = (1×4+0×0+2×2)+(-3) = 5 → ReLU = 5

row 1| row 2| row 3

Problem

In the multi-layer forward pass (each layer Linear & ReLU), fill in the blank (?).

Input X
-1
-2
W
-2
-2
0
0
b
-2
-2
Linear
4
-2
ReLU
A
4
0
W
2
1
-2
-2
b
1
2
Linear
9
-6
ReLU
Y
9
1 / 20