[Abstract & Introduction] 3-line summary + problem framing
3-line summary
- Long-context LLMs keep past Key and Value embeddings in memory, so the KV cache becomes the real VRAM bottleneck.
- Older approach: even if you store numbers with very few bits, each chunk usually needs extra helper numbers that say how to map those short codes back to the original range. Those helpers are often stored in a more precise format (e.g. FP16), so VRAM savings don’t feel as big as you’d hope.
- PolarQuant applies random preconditioning, moves to polar coordinates, and stores angles compactly. With less reliance on that heavy side bookkeeping, it still reaches >4.2 KV-cache compression with strong long-context quality.
Intuition by analogy: boxes with labels vs a compass warehouse
Conventional quantization is like shrinking the boxes in a warehouse but still attaching a heavy label to each box that records its range. PolarQuant instead mixes the content first, keeps one radius value, and then records mostly directional information. The storage becomes lighter because the expensive labels disappear.
In plain words
- The dilemma: shrinking numbers is not enough—each block still needs a heavy FP16 “manual” (scale, zero-point, …). Packaging can outweigh the payload.
- What PolarQuant stores: swap axis values for one radius (overall size) plus angles (direction)—“which way does the mass tilt?”.
- Mixer and : random mixing before polar encoding often balances left vs right halves; angles then cluster near . A narrow, predictable band means few bits can quantize angles.
- Reported wins: >4.2 KV-cache compression, strong ~104K needle tests, and much faster prefill with offline codebooks (~3.4s vs ~11.6s in one paper setting).
One line: exploit predictable angle concentration to drop most of that extra reconstruction overhead.