Ch.01
Weight Initialization
Weight initialization is choosing the initial values of each layer's weights and biases before training. A bad start leads to vanishing or exploding gradients and often makes learning nearly impossible; a good start leads to faster convergence and stable training. This chapter covers the concept of initialization, the intuition and formulas behind Xavier and He initialization, and how they are used in practice.
Weight Initialization: A Good Start Is Half the Battle
What is weight initialization? — Each layer of a neural network has weights and biases . Before training, these values are undefined, so we must choose what numbers to use initially. This process is called weight initialization. Intuitively, it is like choosing where to start a marathon: too far back (weights too small) and progress is slow; too far forward (weights too large) and training can explode and diverge.Mathematically — In one layer the linear combination is , where is the input vector, the weight matrix, and the bias. If all elements of are zero, every neuron in that layer gives the same output, symmetry is preserved, and gradients do not spread properly during backprop. So we usually initialize with small random numbers, but the distribution (scale) of those numbers matters. We adjust variance using the layer's input dimension and output dimension so that activations do not grow or shrink too much as they pass through.In practice — With bad initialization, a spam classifier may show almost no decrease in loss or NaNs. In deep CNNs (e.g. medical imaging or fraud detection), skipping Xavier or He often leads to vanishing gradients in early layers and training appears stuck. If the scale is too large, gradients explode and training becomes unstable. So in practice Xavier (for tanh·sigmoid) or He (for ReLU) initialization is standard.
Weight initialization is the first step of training: set and for each layer at an appropriate scale so that variance is preserved during forward and back propagation.
Effect of initialization scale
Good initialization sets W, b scale so that variance is preserved across layers.
A good start is half the battleProper initialization → fast convergence · stable training
- ① Initialize: set , for each layer (e.g. Xavier/He)
- ② Forward: input → linear sum → activation → next layer
- ③ Loss then backprop: gradients pass through layers
- ④ Update: update , from gradients. Good init keeps gradient scale stable
Why it matters
Vanishing and exploding gradients — In deeper networks, backpropagated gradients are products of many numbers (chain rule). If weights are too small, this product tends to zero (vanishing gradient) and early layers barely update; if too large, it explodes (exploding gradient) and you get NaN or Inf. Good initialization keeps variance stable across layers so that gradients stay at a reasonable scale even in deep networks.Convergence speed — With proper initialization you start at a good point on the loss surface. A bad starting point can trap you in poor local minima or make convergence very slow. In practice, initialization is tuned together with learning rate by monitoring validation loss.
How it is used
Xavier (Glorot) initialization — So that the variance of does not depend on input/output size, is sampled from a uniform distribution or a normal distribution with . It fits symmetric activations like tanh and sigmoid.He initialization — ReLU zeros out negative inputs, so output variance is about half the input variance. He initialization uses to compensate. It is the default in modern CNNs and MLPs that use ReLU or Leaky ReLU.Practical choice — Use He for ReLU-family activations and Xavier for tanh·sigmoid. Frameworks (PyTorch, TensorFlow) usually apply one of these by default depending on the layer type.
Summary
Weight initialization is the process of setting the initial values of each layer's weights and biases before training. Initializing everything to zero keeps neurons symmetric and prevents proper learning; random values that are too large or too small lead to exploding or vanishing activations and gradients. Xavier and He initialization adjust variance based on layer size and are widely used: Xavier for symmetric activations like tanh·sigmoid, He for ReLU-family. A good start reduces vanishing and exploding gradients and makes convergence faster and more stable.
Problem-solving guide
Summary — Weight initialization is the step of choosing initial values for each layer's and before training. Zero initialization breaks learning due to symmetry, so we usually use small random numbers and adjust variance (scale). Xavier uses for tanh·sigmoid; He uses for ReLU-family. Good initialization reduces vanishing/exploding gradients and speeds convergence.
- TypeWeight init definition
- Solution / example (keyword → answer)Setting , before training. Zero init not recommended. → concept choice 1
- TypeXavier
- Solution / example (keyword → answer), tanh·sigmoid. → 1
- TypeHe
- Solution / example (keyword → answer), ReLU family. → 2
- TypeVanishing gradient
- Solution / example (keyword → answer)Gradient nears 0 when weights too small. → 1
- TypeExploding gradient
- Solution / example (keyword → answer)Gradient explodes when weights too large. → 2
- TypeXavier uniform
- Solution / example (keyword → answer)Range . Use integer when computing.
- TypeLayer size
- Solution / example (keyword → answer), → .
| Type | Solution / example (keyword → answer) |
|---|---|
| Weight init definition | Setting , before training. Zero init not recommended. → concept choice 1 |
| Xavier | , tanh·sigmoid. → 1 |
| He | , ReLU family. → 2 |
| Vanishing gradient | Gradient nears 0 when weights too small. → 1 |
| Exploding gradient | Gradient explodes when weights too large. → 2 |
| Xavier uniform | Range . Use integer when computing. |
| Layer size | , → . |
Example (definition)
"What is the main purpose of weight initialization?
① Match layer scale before training
② Increase learning rate
③ Data augmentation"
Purpose is to keep activation and gradient scale stable across layers. → Answer 1
Example (Xavier vs He)
"Common initialization for layers using ReLU?
① Xavier
② He
③ Zero"
He is used for ReLU-family. → Answer 2
Example (calculation)
When , , what is (integer) in Xavier?
. → Answer 10
Definition example — "What is the main purpose of weight initialization?
① Match layer scale before training
② Increase learning rate
③ Data augmentation" → Purpose is to keep scale stable across layers. Answer 1
True/False example — "Weight initialization is the process of setting , before training." → True. Answer 1
Scenario example — "When loss barely decreases in a spam classifier, what to check first?
① Initialization·learning rate
② Data size only
③ Batch size only" → Check initialization·learning rate first. Answer 1
Choice example — "In He initialization, is?
①
②
③ " → He uses . Answer 1
Concept example — "In Xavier, if , the value (integer) of is?
① 1
② 2
③ 3" → . Answer 1
Calc example — "When , , the value (integer) of in Xavier is?" → . Answer 10