Everyone's AI
Machine learningAI Papers

Learn

  • AI Papers
  • Theory & math
    • 2026
      • CPAL
        • Kernel von Mises Formula of the Influence Function
  • Optimization & efficiency
    • PolarQuant: Quantizing KV Caches with Polar Transformation
  • Architecture & algorithms
    • 2026
      • CPAL
        • AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers
  • Tabular & prediction
  • AutoML & ML pipelines
    • 2025
      • ICML
        • AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML
      • ICLR
        • SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning
  • Vision & multimodal
  • NLP & LLMs
    • 2026
      • CPAL
        • The Curse of Depth in Large Language Models
  • Trust & XAI
  • Data-centric & features
  • Edge & web AI
  • Domain applications
🏅My achievements
Learn/AI Papers/NLP & LLMs/CPAL2026/The Curse of Depth in Large Language Models

The Curse of Depth in Large Language Models

This review explains why simply adding more layers does not always buy more representation power in large language models. The paper analyzes variance accumulation in Pre-LN transformers and shows that a single depth-aware rule, LayerNorm Scaling (LNS), can keep deep layers useful instead of letting them collapse into identity-like behavior.
PDFView original paper (PDF)↗

[Abstract & Introduction] 3-line summary + problem setup

- Core problem: Many deep layers in large LLMs contribute less than expected and can drift toward identity-like behavior.
- Classical limitation: Pre-LN improves optimization stability, but variance can still accumulate with depth.
- Key fix: LNS multiplies the normalized signal by 1l\frac{1}{\sqrt{l}}l​1​, suppressing deep-layer variance growth and restoring useful layer participation.
Analogy: Imagine a stadium audio chain with 100 amplifiers in series. Without careful control, tiny noise added at each stage eventually drowns out the original voice. LNS acts like a smart limiter that lowers the volume more aggressively in later amplifiers so the original signal survives all the way to the end.

[Background] Essential concepts (expanded)

Use definition → why depth makes it tricky → link to this paper as you read.
- Residual connections
Blocks look like hl+1=hl+F(hl)h_{l+1}=h_l+F(h_l)hl+1​=hl​+F(hl​): a skip path carries hlh_lhl​ forward while FFF proposes an update. Depth helps gradients and representation building, but also creates an additive channel where small fluctuations from each FFF can pile up across layers—stability and variance growth are two sides of the same coin. This paper’s LNS targets that pile-up with a depth-tied scale so late layers keep transforming instead of pass-through copying.
- Pre-LN vs Post-LN
Pre-LN: hl+1=hl+F(LN(hl))h_{l+1}=h_l+F(\mathrm{LN}(h_l))hl+1​=hl​+F(LN(hl​)) — normalize before the sublayer; dominant in large LLMs for early-training stability. Post-LN: hl+1=LN(hl+F(hl))h_{l+1}=\mathrm{LN}(h_l+F(h_l))hl+1​=LN(hl​+F(hl​)) — normalize after the residual add; can be powerful but often harder to optimize from scratch. The paper works inside the Pre-LN regime and studies depth-wise under-transformation (the “curse of depth”), then fixes it with a single scaling line.
- Variance explosion
Activations can spread as depth grows so tiny input differences balloon late in the stack. Residual addition repeatedly adds noise-like components, so without control the effective scale drifts up; optimizers then favor mappings that change little (near-identity). The 1l\frac{1}{\sqrt{l}}l​1​ factor is read as forcing effective variance down like 1l\frac{1}{l}l1​ relative to the normalized branch.
- Identity collapse
If Jl=∂hout(l)/∂hin(l)≈IJ_l=\partial h^{(l)}_{out}/\partial h^{(l)}_{in}\approx IJl​=∂hout(l)​/∂hin(l)​≈I, the layer is locally copying its input: compute is spent but little new feature is created—classic wasted depth. LNS is argued to delay that collapse and recover usable depth.
- Depth utilization
Measures whether extra layers/FLOPs actually buy better loss or downstream scores. If not, depth is mostly overhead. The paper reframes the contribution as making depth work to the end, not merely stacking more blocks.

[Proposed method] Core idea and main equation

The central formula is strikingly simple:
h~(l)=LN(h(l))⋅1l\tilde{h}^{(l)} = \mathrm{LN}(h^{(l)}) \cdot \frac{1}{\sqrt{l}}h~(l)=LN(h(l))⋅l​1​
The message is: the deeper the layer, the more carefully we control its output amplitude.
A helpful variance-side reading is
Var[h~(l)]≈1l Var[LN(h(l))]\mathrm{Var}\left[\tilde{h}^{(l)}\right] \approx \frac{1}{l}\,\mathrm{Var}\left[\mathrm{LN}(h^{(l)})\right]Var[h~(l)]≈l1​Var[LN(h(l))]
so the effective signal scale is progressively damped as depth increases.
Symbols & how to read them (same section)
No separate glossary card—below is a compact pass over every symbol used in the formulas above.
- lll: current layer index.
- LLL: total number of layers.
- h(l)∈Rdh^{(l)} \in \mathbb{R}^{d}h(l)∈Rd: activation entering LayerNorm at layer lll.
- ddd: hidden dimension.
- LN(h(l))\mathrm{LN}(h^{(l)})LN(h(l)): normalized signal before depth scaling.
- h~(l)\tilde{h}^{(l)}h~(l): output after LayerNorm Scaling.
- 1l\frac{1}{\sqrt{l}}l​1​: depth-aware damping factor that becomes stronger in deeper layers.
- Residual/attention/FFN expand representation power; LNS keeps that expansion numerically under control.
- The point is not to weaken deep layers, but to keep them stable enough to keep learning.

[Step-by-step intuition]

- Step 1: LN(h(l))\mathrm{LN}(h^{(l)})LN(h(l)) standardizes the signal and aligns its scale.
- Step 2: 1l\frac{1}{\sqrt{l}}l​1​ is a depth-dependent brake that becomes stronger in later layers.
- Step 3: Residual, attention, and FFN act like the accelerator; LNS acts like the brake.
- Step 4: Their combination keeps deep layers expressive without letting variance run out of control.
In short, LNS is not about killing deep layers. It is about keeping them numerically healthy enough to keep learning.

[Toy walkthrough] How the formula behaves in motion

Consider a 6-layer transformer where residual additions gradually increase activation amplitude.
1. At l=1l=1l=1, the scale is 1.01.01.0, so almost the full signal is passed through.
2. At l=2l=2l=2, the scale becomes about 0.7070.7070.707, slightly damping the rising amplitude.
3. At l=3l=3l=3, the scale is about 0.5770.5770.577, further suppressing accumulated noise.
4. At l=4l=4l=4, the scale reaches 0.50.50.5, making later-layer growth much more controlled.
5. At l=5l=5l=5 and l=6l=6l=6, the scale becomes even smaller, preventing deep-layer blow-up while preserving meaningful transformations.
The intuition is simple: early layers keep enough freedom to build features, while later layers are prevented from turning into unstable amplifiers.

[Experiments and results]

The paper reports that LNS improves convergence behavior from smaller models up to multi-billion-parameter scale.
- It is hyperparameter-free in the sense that the rule is fixed by depth.
- It lowers final loss in large-scale experiments compared with vanilla Pre-LN.
- It preserves more angular diversity across deep-layer representations, suggesting that late layers remain meaningfully distinct instead of collapsing toward similar states.
From an engineering perspective, this is attractive because the implementation cost is tiny while the potential payoff on depth efficiency is large.

[Conclusion and limitations]

- Practical value 1: Better depth utilization creates a stronger starting point for pruning, quantization, and efficiency work.
- Practical value 2: More useful deep features can help downstream fine-tuning and task adaptation.
- Practical value 3: The method is easy to insert into existing Pre-LN pipelines without architectural surgery.
Limitations: The analysis mainly targets Pre-LN transformers. Generalization to Post-LN, normalization-free models, and multimodal branches remains an open direction.

Visualization plan: uncontrolled amplification vs controlled depth scaling

The left panel shows variance growing with depth in legacy Pre-LN, while the right panel shows how LNS keeps amplitude under control as layers get deeper. For responsive UI, keep `minHeight: 320px` and use an SVG `viewBox`-based layout.

Legacy Pre-LN

Variance keeps building up, so late layers drift toward identity-like behavior.

Layer contributionLayer 1Layer LLate layers become near-identityVariance growth

Proposed LNS

Depth-aware damping stabilizes amplitude and keeps deep layers useful.

Layer contributionLayer 1Layer LDeep layers stay usefulControlled amplitude
The appeal of LNS is that it attacks the curse of depth with almost no architectural overhead. The paper turns "more depth" from a fragile scaling strategy into something much closer to usable learning capacity.

관련 AI논문

  • - AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML
  • - AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers
  • - Kernel von Mises Formula of the Influence Function