Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.01

Weight Initialization

Weight initialization is choosing the initial values of each layer's weights and biases before training. A bad start leads to vanishing or exploding gradients and often makes learning nearly impossible; a good start leads to faster convergence and stable training. This chapter covers the concept of initialization, the intuition and formulas behind Xavier and He initialization, and how they are used in practice.

Weight Initialization: A Good Start Is Half the Battle

What is weight initialization? — Each layer of a neural network has weights WWW and biases bbb. Before training, these values are undefined, so we must choose what numbers to use initially. This process is called weight initialization. Intuitively, it is like choosing where to start a marathon: too far back (weights too small) and progress is slow; too far forward (weights too large) and training can explode and diverge.Mathematically — In one layer the linear combination is z=Wx+bz = W \mathbf{x} + bz=Wx+b, where x\mathbf{x}x is the input vector, WWW the weight matrix, and bbb the bias. If all elements of WWW are zero, every neuron in that layer gives the same output, symmetry is preserved, and gradients do not spread properly during backprop. So we usually initialize with small random numbers, but the distribution (scale) of those numbers matters. We adjust variance using the layer's input dimension ninn_{in}nin​ and output dimension noutn_{out}nout​ so that activations do not grow or shrink too much as they pass through.In practice — With bad initialization, a spam classifier may show almost no decrease in loss or NaNs. In deep CNNs (e.g. medical imaging or fraud detection), skipping Xavier or He often leads to vanishing gradients in early layers and training appears stuck. If the scale is too large, gradients explode and training becomes unstable. So in practice Xavier (for tanh·sigmoid) or He (for ReLU) initialization is standard.
Weight initialization is the first step of training: set WWW and bbb for each layer at an appropriate scale so that variance is preserved during forward and back propagation.

Effect of initialization scale

↓●↑InputLayer 1Layer 2Layer 3OutputVanishingStableExploding
Good initialization sets W, b scale so that variance is preserved across layers.
A good start is half the battleProper initialization → fast convergence · stable training
  • ① Initialize: set WWW, bbb for each layer (e.g. Xavier/He)
  • ② Forward: input → linear sum zzz → activation aaa → next layer
  • ③ Loss then backprop: gradients pass through layers
  • ④ Update: update WWW, bbb from gradients. Good init keeps gradient scale stable

Why it matters

Vanishing and exploding gradients — In deeper networks, backpropagated gradients are products of many numbers (chain rule). If weights are too small, this product tends to zero (vanishing gradient) and early layers barely update; if too large, it explodes (exploding gradient) and you get NaN or Inf. Good initialization keeps variance stable across layers so that gradients stay at a reasonable scale even in deep networks.Convergence speed — With proper initialization you start at a good point on the loss surface. A bad starting point can trap you in poor local minima or make convergence very slow. In practice, initialization is tuned together with learning rate by monitoring validation loss.

How it is used

Xavier (Glorot) initialization — So that the variance of zzz does not depend on input/output size, WWW is sampled from a uniform distribution U(−6/(nin+nout), 6/(nin+nout))U(-\sqrt{6/(n_{in}+n_{out})},\ \sqrt{6/(n_{in}+n_{out})})U(−6/(nin​+nout​)​, 6/(nin​+nout​)​) or a normal distribution N(0, σ2)\mathcal{N}(0,\ \sigma^2)N(0, σ2) with σ2=2/(nin+nout)\sigma^2 = 2/(n_{in}+n_{out})σ2=2/(nin​+nout​). It fits symmetric activations like tanh and sigmoid.He initialization — ReLU zeros out negative inputs, so output variance is about half the input variance. He initialization uses σ2=2/nin\sigma^2 = 2/n_{in}σ2=2/nin​ to compensate. It is the default in modern CNNs and MLPs that use ReLU or Leaky ReLU.Practical choice — Use He for ReLU-family activations and Xavier for tanh·sigmoid. Frameworks (PyTorch, TensorFlow) usually apply one of these by default depending on the layer type.

Summary

Weight initialization is the process of setting the initial values of each layer's weights and biases before training. Initializing everything to zero keeps neurons symmetric and prevents proper learning; random values that are too large or too small lead to exploding or vanishing activations and gradients. Xavier and He initialization adjust variance based on layer size and are widely used: Xavier for symmetric activations like tanh·sigmoid, He for ReLU-family. A good start reduces vanishing and exploding gradients and makes convergence faster and more stable.

Problem-solving guide

Summary — Weight initialization is the step of choosing initial values for each layer's WWW and bbb before training. Zero initialization breaks learning due to symmetry, so we usually use small random numbers and adjust variance (scale). Xavier uses σ2=2/(nin+nout)\sigma^2 = 2/(n_{in}+n_{out})σ2=2/(nin​+nout​) for tanh·sigmoid; He uses σ2=2/nin\sigma^2 = 2/n_{in}σ2=2/nin​ for ReLU-family. Good initialization reduces vanishing/exploding gradients and speeds convergence.
  • TypeWeight init definition
  • Solution / example (keyword → answer)Setting WWW, bbb before training. Zero init not recommended. → concept choice 1
  • TypeXavier
  • Solution / example (keyword → answer)σ2=2/(nin+nout)\sigma^2 = 2/(n_{in}+n_{out})σ2=2/(nin​+nout​), tanh·sigmoid. → 1
  • TypeHe
  • Solution / example (keyword → answer)σ2=2/nin\sigma^2 = 2/n_{in}σ2=2/nin​, ReLU family. → 2
  • TypeVanishing gradient
  • Solution / example (keyword → answer)Gradient nears 0 when weights too small. → 1
  • TypeExploding gradient
  • Solution / example (keyword → answer)Gradient explodes when weights too large. → 2
  • TypeXavier uniform
  • Solution / example (keyword → answer)Range [−6/(nin+nout), 6/(nin+nout)][-\sqrt{6/(n_{in}+n_{out})},\ \sqrt{6/(n_{in}+n_{out})}][−6/(nin​+nout​)​, 6/(nin​+nout​)​]. Use integer when computing.
  • TypeLayer size
  • Solution / example (keyword → answer)nin=4n_{in}=4nin​=4, nout=6n_{out}=6nout​=6 → nin+nout=10n_{in}+n_{out}=10nin​+nout​=10.
TypeSolution / example (keyword → answer)
Weight init definitionSetting WWW, bbb before training. Zero init not recommended. → concept choice 1
Xavierσ2=2/(nin+nout)\sigma^2 = 2/(n_{in}+n_{out})σ2=2/(nin​+nout​), tanh·sigmoid. → 1
Heσ2=2/nin\sigma^2 = 2/n_{in}σ2=2/nin​, ReLU family. → 2
Vanishing gradientGradient nears 0 when weights too small. → 1
Exploding gradientGradient explodes when weights too large. → 2
Xavier uniformRange [−6/(nin+nout), 6/(nin+nout)][-\sqrt{6/(n_{in}+n_{out})},\ \sqrt{6/(n_{in}+n_{out})}][−6/(nin​+nout​)​, 6/(nin​+nout​)​]. Use integer when computing.
Layer sizenin=4n_{in}=4nin​=4, nout=6n_{out}=6nout​=6 → nin+nout=10n_{in}+n_{out}=10nin​+nout​=10.
Example (definition)
"What is the main purpose of weight initialization?
① Match layer scale before training
② Increase learning rate
③ Data augmentation"
Purpose is to keep activation and gradient scale stable across layers. → Answer 1

Example (Xavier vs He)
"Common initialization for layers using ReLU?
① Xavier
② He
③ Zero"
He is used for ReLU-family. → Answer 2

Example (calculation)
When nin=4n_{in}=4nin​=4, nout=6n_{out}=6nout​=6, what is nin+noutn_{in}+n_{out}nin​+nout​ (integer) in Xavier?
4+6=104+6=104+6=10. → Answer 10
Definition example — "What is the main purpose of weight initialization?
① Match layer scale before training
② Increase learning rate
③ Data augmentation" → Purpose is to keep scale stable across layers. Answer 1
True/False example — "Weight initialization is the process of setting WWW, bbb before training." → True. Answer 1
Scenario example — "When loss barely decreases in a spam classifier, what to check first?
① Initialization·learning rate
② Data size only
③ Batch size only" → Check initialization·learning rate first. Answer 1
Choice example — "In He initialization, σ2\sigma^2σ2 is?
① 2/nin2/n_{in}2/nin​
② 2/(nin+nout)2/(n_{in}+n_{out})2/(nin​+nout​)
③ 1/nin1/n_{in}1/nin​" → He uses σ2=2/nin\sigma^2=2/n_{in}σ2=2/nin​. Answer 1
Concept example — "In Xavier, if nin+nout=6n_{in}+n_{out}=6nin​+nout​=6, the value (integer) of 6/(nin+nout)6/(n_{in}+n_{out})6/(nin​+nout​) is?
① 1
② 2
③ 3" → 6/6=16/6=16/6=1. Answer 1
Calc example — "When nin=4n_{in}=4nin​=4, nout=6n_{out}=6nout​=6, the value (integer) of nin+noutn_{in}+n_{out}nin​+nout​ in Xavier is?" → 4+6=104+6=104+6=10. Answer 10