The Curse of Depth in Large Language Models

This review explains why simply adding more layers does not always buy more representation power in large language models. The paper analyzes variance accumulation in Pre-LN transformers and shows that a single depth-aware rule, LayerNorm Scaling (LNS), can keep deep layers useful instead of letting them collapse into identity-like behavior.
PDFView original paper (PDF)
### [Abstract & Introduction] 3-line summary + problem setup
- Core problem: Many deep layers in large LLMs contribute less than expected and can drift toward identity-like behavior.
- Classical limitation: Pre-LN improves optimization stability, but variance can still accumulate with depth.
- Key fix: LNS multiplies the normalized signal by 1l\frac{1}{\sqrt{l}}, suppressing deep-layer variance growth and restoring useful layer participation.
Analogy: Imagine a stadium audio chain with 100 amplifiers in series. Without careful control, tiny noise added at each stage eventually drowns out the original voice. LNS acts like a smart limiter that lowers the volume more aggressively in later amplifiers so the original signal survives all the way to the end.
### [Background] What you need before the math
- Residual connections: Helpful for information flow, but they also create an additive path where variance can keep accumulating.
- Pre-LN vs Post-LN: Modern LLMs prefer Pre-LN for training stability, but deep stacks can still become under-transformative.
- Variance explosion: Later layers can see overly large effective scale because fluctuations pile up through depth.
- Identity collapse: A deep layer spends compute yet behaves almost like a pass-through map.
- Depth utilization: The real question is whether extra depth actually buys extra learning capacity.
### [Proposed method] Core idea and main equation
The central formula is strikingly simple:
h~(l)=LN(h(l))1l\tilde{h}^{(l)} = \mathrm{LN}(h^{(l)}) \cdot \frac{1}{\sqrt{l}}
The message is: the deeper the layer, the more carefully we control its output amplitude.
A helpful variance-side reading is
Var[h~(l)]1lVar[LN(h(l))]\mathrm{Var}\left[\tilde{h}^{(l)}\right] \approx \frac{1}{l}\,\mathrm{Var}\left[\mathrm{LN}(h^{(l)})\right]
so the effective signal scale is progressively damped as depth increases.

Symbol breakdown

ll: current layer index.
LL: total number of layers.
h(l)Rdh^{(l)} \in \mathbb{R}^{d}: activation entering LayerNorm at layer ll.
dd: hidden dimension.
LN(h(l))\mathrm{LN}(h^{(l)}): normalized signal before depth scaling.
h~(l)\tilde{h}^{(l)}: output after LayerNorm Scaling.
1l\frac{1}{\sqrt{l}}: depth-aware damping factor that becomes stronger in deeper layers.
Residual/attention/FFN expand representation power; LNS keeps that expansion numerically under control.
The point is not to weaken deep layers, but to keep them stable enough to keep learning.
### [Step-by-step intuition]
- Step 1: LN(h(l))\mathrm{LN}(h^{(l)}) standardizes the signal and aligns its scale.
- Step 2: 1l\frac{1}{\sqrt{l}} is a depth-dependent brake that becomes stronger in later layers.
- Step 3: Residual, attention, and FFN act like the accelerator; LNS acts like the brake.
- Step 4: Their combination keeps deep layers expressive without letting variance run out of control.
In short, LNS is not about killing deep layers. It is about keeping them numerically healthy enough to keep learning.
### [Toy walkthrough] How the formula behaves in motion
Consider a 6-layer transformer where residual additions gradually increase activation amplitude.
1. At l=1l=1, the scale is 1.01.0, so almost the full signal is passed through.
2. At l=2l=2, the scale becomes about 0.7070.707, slightly damping the rising amplitude.
3. At l=3l=3, the scale is about 0.5770.577, further suppressing accumulated noise.
4. At l=4l=4, the scale reaches 0.50.5, making later-layer growth much more controlled.
5. At l=5l=5 and l=6l=6, the scale becomes even smaller, preventing deep-layer blow-up while preserving meaningful transformations.
The intuition is simple: early layers keep enough freedom to build features, while later layers are prevented from turning into unstable amplifiers.
### [Experiments and results]
The paper reports that LNS improves convergence behavior from smaller models up to multi-billion-parameter scale.
- It is hyperparameter-free in the sense that the rule is fixed by depth.
- It lowers final loss in large-scale experiments compared with vanilla Pre-LN.
- It preserves more angular diversity across deep-layer representations, suggesting that late layers remain meaningfully distinct instead of collapsing toward similar states.
From an engineering perspective, this is attractive because the implementation cost is tiny while the potential payoff on depth efficiency is large.
### [Conclusion and limitations]
- Practical value 1: Better depth utilization creates a stronger starting point for pruning, quantization, and efficiency work.
- Practical value 2: More useful deep features can help downstream fine-tuning and task adaptation.
- Practical value 3: The method is easy to insert into existing Pre-LN pipelines without architectural surgery.
Limitations: The analysis mainly targets Pre-LN transformers. Generalization to Post-LN, normalization-free models, and multimodal branches remains an open direction.

Visualization plan: uncontrolled amplification vs controlled depth scaling

The left panel shows variance growing with depth in legacy Pre-LN, while the right panel shows how LNS keeps amplitude under control as layers get deeper. For responsive UI, keep `minHeight: 320px` and use an SVG `viewBox`-based layout.

Legacy Pre-LN

Variance keeps building up, so late layers drift toward identity-like behavior.

Layer contributionLayer 1Layer LLate layers become near-identityVariance growth

Proposed LNS

Depth-aware damping stabilizes amplitude and keeps deep layers useful.

Layer contributionLayer 1Layer LDeep layers stay usefulControlled amplitude
The appeal of LNS is that it attacks the curse of depth with almost no architectural overhead. The paper turns "more depth" from a fragile scaling strategy into something much closer to usable learning capacity.