Chapter 1: PolarQuant: Quantizing KV Caches with Polar Transformation

In long-context LLM serving, the bottleneck is often not the model weights but the KV cache memory. PolarQuant attacks that bottleneck directly: after random preconditioning, it rewrites a KV vector in polar form and stores angles compactly, cutting the usual burden of extra “how to reconstruct the numbers” side information. This review unpacks the main formulas, why the angle distribution concentrates near

\pi/4

, and what that means for real systems.

PDFView original PDF

\times

\Rightarrow

S

\pi/4

S

[Conclusion & Limitations] Practical significance 1. PolarQuant shows that quantization does not have to carry normalization metadata forever. 2. It directly targets the memory hotspot of long-context serving. 3. It changes the cache representation without requiring a new attention mechanism. Limitations - Codebook construction still leaves room for better analytic designs. - The paper is strongest on KV-cache quantization; extending the idea to weights or activations needs more evidence. - Real deployment still depends on efficient kernels, packing layouts, and careful implementation.

45^\circ

KV storage at a glance

Legacy stacks FP16 metadata per block; PolarQuant keeps r and angles.

Block quant

PolarQuant

How to read the diagram labels

FP16: Half-precision floating point (16 bits per number). About half the footprint of FP32 for the same count of values; slightly coarser grid of representable numbers.
Quantization: Rounding real values onto a small set of integer codes to save bits. At use time you dequantize; you often need per-block helper numbers to map codes back to the right range.
KV: A chunk of cached Key/Value vectors for past tokens (attention memory).
INT4: Values packed into 4-bit integers—small, but not usable without extra info.
+meta / FP16: High-precision helper numbers (scale, zero-point, etc.) needed to dequantize; stored separately.
× N: Roughly: that metadata repeats for every block, so cost grows with N.
S: Random matrix that mixes coordinates (preconditioning) before the polar transform.
r: Radius: overall magnitude in polar form.
θ: Angle (direction). Often stored as a codebook index instead of a full float.
codebook: A small table of typical angles—like a palette—so you only store an index.

PolarQuant is elegant because it changes the coordinate system of the problem. Instead of forcing raw coordinates into low bits and paying normalization overhead, it stores one radius and a set of structured angles. That makes it especially attractive when KV-cache memory, not model size, is the true serving bottleneck.

Chapter 1: PolarQuant: Quantizing KV Caches with Polar Transformation

\pi/4

, and what that means for real systems.

PDFView original PDF

Symbol

Meaning

x

Original KV vector to quantize (one head, one token)

d

Original dimension

S

Random preconditioning matrix (drawn under the paper's assumptions)

m

Output dimension after sketching

I_m

m \times m

identity matrix

\|x\|_2

Euclidean norm of

x

\mathcal{N}(0, \sigma^2 I_m)

Multivariate normal with zero mean and isotropic covariance

\sigma^2 I_m