The Curse of Depth in Large Language Models

This review explains why simply adding more layers does not always buy more representation power in large language models. The paper analyzes variance accumulation in Pre-LN transformers and shows that a single depth-aware rule, LayerNorm Scaling (LNS), can keep deep layers useful instead of letting them collapse into identity-like behavior.

PDFView original paper (PDF)

\frac{1}{\sqrt{l}}

h_{l+1}=h_l+F(h_l)

\tilde{h}^{(l)} = \mathrm{LN}(h^{(l)}) \cdot \frac{1}{\sqrt{l}}

l=1

[Experiments and results] The paper reports that LNS improves convergence behavior from smaller models up to multi-billion-parameter scale. - It is hyperparameter-free in the sense that the rule is fixed by depth. - It lowers final loss in large-scale experiments compared with vanilla Pre-LN. - It preserves more angular diversity across deep-layer representations, suggesting that late layers remain meaningfully distinct instead of collapsing toward similar states. From an engineering perspective, this is attractive because the implementation cost is tiny while the potential payoff on depth efficiency is large.

[Conclusion and limitations] - Practical value 1: Better depth utilization creates a stronger starting point for pruning, quantization, and efficiency work. - Practical value 2: More useful deep features can help downstream fine-tuning and task adaptation. - Practical value 3: The method is easy to insert into existing Pre-LN pipelines without architectural surgery. Limitations: The analysis mainly targets Pre-LN transformers. Generalization to Post-LN, normalization-free models, and multimodal branches remains an open direction.

Visualization plan: uncontrolled amplification vs controlled depth scaling

The left panel shows variance growing with depth in legacy Pre-LN, while the right panel shows how LNS keeps amplitude under control as layers get deeper. For responsive UI, keep `minHeight: 320px` and use an SVG `viewBox`-based layout.

Legacy Pre-LN

Variance keeps building up, so late layers drift toward identity-like behavior.

Proposed LNS

Depth-aware damping stabilizes amplitude and keeps deep layers useful.

The appeal of LNS is that it attacks the curse of depth with almost no architectural overhead. The paper turns "more depth" from a fragile scaling strategy into something much closer to usable learning capacity.

The Curse of Depth in Large Language Models

[Abstract & Introduction] 3-line summary + problem setup

[Background] Essential concepts (expanded)

[Proposed method] Core idea and main equation

[Step-by-step intuition]

[Toy walkthrough] How the formula behaves in motion

[Experiments and results]

[Conclusion and limitations]

Visualization plan: uncontrolled amplification vs controlled depth scaling

Legacy Pre-LN

Proposed LNS

관련 AI논문

The Curse of Depth in Large Language Models

[Abstract & Introduction] 3-line summary + problem setup

[Background] Essential concepts (expanded)

[Proposed method] Core idea and main equation

[Step-by-step intuition]

[Toy walkthrough] How the formula behaves in motion

[Experiments and results]

[Conclusion and limitations]

Visualization plan: uncontrolled amplification vs controlled depth scaling

Legacy Pre-LN

Proposed LNS

관련 AI논문