### [Abstract & Introduction] 3-line summary + problem setup
- Core problem: Many deep layers in large LLMs contribute less than expected and can drift toward identity-like behavior.
- Classical limitation: Pre-LN improves optimization stability, but variance can still accumulate with depth.
- Key fix: LNS multiplies the normalized signal by , suppressing deep-layer variance growth and restoring useful layer participation.
Analogy: Imagine a stadium audio chain with 100 amplifiers in series. Without careful control, tiny noise added at each stage eventually drowns out the original voice. LNS acts like a smart limiter that lowers the volume more aggressively in later amplifiers so the original signal survives all the way to the end.