Everyone's AI
Machine learningAI Papers

Learn

  • AI Papers
  • Theory & math
    • CPAL2026
      • Kernel von Mises Formula of the Influence Function
  • Optimization & efficiency
    • PolarQuant: Quantizing KV Caches with Polar Transformation
  • Architecture & algorithms
    • CPAL2026
      • AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers
  • Tabular & prediction
  • AutoML & ML pipelines
    • ICML 2025
      • AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML
  • Vision & multimodal
  • NLP & LLMs
    • CPAL2026
      • The Curse of Depth in Large Language Models
  • Trust & XAI
  • Data-centric & features
  • Edge & web AI
  • Domain applications
🏅My achievements
Learn/AI Papers/Optimization & efficiency/Chapter 1: PolarQuant: Quantizing KV Caches with Polar Transformation

Chapter 1: PolarQuant: Quantizing KV Caches with Polar Transformation

In long-context LLM serving, the bottleneck is often not the model weights but the KV cache memory. PolarQuant attacks that bottleneck directly: after random preconditioning, it rewrites a KV vector in polar form and stores angles compactly, cutting the usual burden of extra “how to reconstruct the numbers” side information. This review unpacks the main formulas, why the angle distribution concentrates near π/4\pi/4π/4, and what that means for real systems.
PDFView original PDF↗
[Abstract & Introduction] 3-line summary + problem framing
3-line summary
- Long-context LLMs keep past Key and Value embeddings in memory, so the KV cache becomes the real VRAM bottleneck.
- Older approach: even if you store numbers with very few bits, each chunk usually needs extra helper numbers that say how to map those short codes back to the original range. Those helpers are often stored in a more precise format (e.g. FP16), so VRAM savings don’t feel as big as you’d hope.
- PolarQuant applies random preconditioning, moves to polar coordinates, and stores angles compactly. With less reliance on that heavy side bookkeeping, it still reaches >4.2×\times× KV-cache compression with strong long-context quality.
Intuition by analogy: boxes with labels vs a compass warehouse
Conventional quantization is like shrinking the boxes in a warehouse but still attaching a heavy label to each box that records its range. PolarQuant instead mixes the content first, keeps one radius value, and then records mostly directional information. The storage becomes lighter because the expensive labels disappear.
In plain words
- The dilemma: shrinking numbers is not enough—each block still needs a heavy FP16 “manual” (scale, zero-point, …). Packaging can outweigh the payload.
- What PolarQuant stores: swap axis values for one radius rrr (overall size) plus angles (direction)—“which way does the mass tilt?”.
- Mixer SSS and 45∘45^\circ45∘: random mixing before polar encoding often balances left vs right halves; angles then cluster near π/4\pi/4π/4. A narrow, predictable band means few bits can quantize angles.
- Reported wins: >4.2×\times× KV-cache compression, strong ~104K needle tests, and much faster prefill with offline codebooks (~3.4s vs ~11.6s in one paper setting).
One line: exploit predictable angle concentration to drop most of that extra reconstruction overhead.
[Background] Essential prerequisites
Read this first — before the bullet list
- Floating-point (FP32 vs FP16) Neural nets usually store activations as floating-point numbers. FP32 is 32-bit single precision; FP16 is half precision—about half the bits per number, so you can fit roughly twice as many values in the same memory, with a slightly coarser grid of representable values. When papers say the KV cache is kept in FP16, they mean the uncompressed baseline uses 16-bit floats.
- What quantization means You round many possible real values onto a small set of integer codes (e.g. 4 bits ⇒\Rightarrow⇒ 24=162^4 = 1624=16 levels). The usual goals are less memory and sometimes faster math; at inference time you dequantize back to something close to the original. You often need extra helper numbers (scale, zero-point, …) so each short code can be mapped back to the right numeric range.
- Why INT4 and FP16 appear in the same sentence The payload weights may be INT4, but per-block metadata that explains how to decode those codes is often stored in a wider, higher-precision format (e.g. FP16). That bookkeeping can eat a lot of the apparent bit savings—the sections below show how PolarQuant tries to shrink that layer.

Each item below pairs a definition with what it means for PolarQuant (not just vocabulary).
- KV cache stores past Keys and Values so autoregressive decoding can avoid recomputation. As sequence length grows, this cache grows almost linearly and often dominates inference memory.
Meaning here: the vector xxx in later formulas is best read as a cached activation, not a model weight. PolarQuant changes how that cache is represented, not the attention definition.
- Quantization overhead To squeeze values into very short integers, you typically need per-chunk helper numbers that set the mapping back to the real range (in papers/code these are often called scale, zero-point, etc.). Those helpers are often stored per block and in high precision, so they add hidden cost.
Meaning here: part of why “bits went down but VRAM didn’t” is those helper numbers tagging along. PolarQuant tries to shrink that layer by changing coordinates.
- Polar coordinates replace raw coordinates with one radius and several angles. PolarQuant extends this intuition to high dimensions through a recursive construction.
Meaning here: you stop arguing in raw axis units and instead store one overall scale (radius) + directional information (angles), which is what enables storing angles compactly without the old per-block scaling pattern.
- Random preconditioning mixes the coordinates so the transformed vector behaves like a Gaussian object instead of being aligned with a few unstable axes.
Meaning here: SSS is not cosmetic—it is what makes the subsequent Gaussian-style analysis of angles legitimate enough to design codebooks.
- Concentration in high dimensions is the magic ingredient: after preconditioning, certain angles concentrate sharply, which makes them cheap to quantize.
Meaning here: "where the angle mass sits" directly becomes how many bits you need—tight mass means simpler codebooks.
Everyday hook: the KV cache is the warehouse for past tokens; quantization shrinks the numbers in each bin. But if every bin also needs a sticky note explaining how to decode it, the notes can outweigh the savings.
[Method] Core proposal — formula walkthrough (detailed)
PolarQuant asks: *after random mixing, what is the cheapest way to store a KV vector?*
How to read this section (roadmap)
You do not need to memorize every symbol. Keep the order of ideas and skip dense parts if needed.
1. Mix with SSS — spread mass away from a few axes so later per-angle quantization is easier to justify.
2. Build angles in layers — level 1 is “direction of each neighbor pair”; higher levels compare larger left/right chunks with one angle each.
3. Joint density factorizes — length (radius) and angle blocks behave almost independently, so design splits.
4. Angles pile up near π/4\pi/4π/4 — when typical angles sit in a narrow band, few bits suffice.
5. Store angles compactly → reconstruct coordinates → feed standard attention — the attention formula itself is unchanged.
Light reading: for
(4) –
(5) you can ignore exponents and Γ(⋅)\Gamma(\cdot)Γ(⋅); it is enough to remember “probability mass concentrates near π/4\pi/4π/4.”

1) Random preconditioning: why SxSxSx looks Gaussian
Intuition first: Sx∼N(⋯ )Sx \sim \mathcal{N}(\cdots)Sx∼N(⋯) is a modeling picture: coordinates of SxSxSx jiggle with similar strength and can be analyzed mostly separately. Real samples need not be perfectly Gaussian—this is the approximation that supports per-angle quantization.
Start from x∈Rdx \in \mathbb{R}^{d}x∈Rd and multiply by a random sketch matrix S∈Rm×dS \in \mathbb{R}^{m \times d}S∈Rm×d:
Sx∼N(0,∥x∥22Im)Sx \sim \mathcal{N}(0, \|x\|_2^2 I_m)Sx∼N(0,∥x∥22​Im​)
One-sentence meaning: each coordinate of SxSxSx is an independent Gaussian with mean 0 and variance ∥x∥22\|x\|_2^2∥x∥22​.
Symbol breakdown
  • Symbolxxx
  • MeaningOriginal KV vector to quantize (one head, one token)
  • Symbolddd
  • MeaningOriginal dimension
  • SymbolSSS
  • MeaningRandom preconditioning matrix (drawn under the paper's assumptions)
  • Symbolmmm
  • MeaningOutput dimension after sketching
  • SymbolImI_mIm​
  • Meaningm×mm \times mm×m identity matrix
  • Symbol∥x∥2\|x\|_2∥x∥2​
  • MeaningEuclidean norm of xxx
  • SymbolN(0,σ2Im)\mathcal{N}(0, \sigma^2 I_m)N(0,σ2Im​)
  • MeaningMultivariate normal with zero mean and isotropic covariance σ2Im\sigma^2 I_mσ2Im​
SymbolMeaning
xxxOriginal KV vector to quantize (one head, one token)
dddOriginal dimension
SSSRandom preconditioning matrix (drawn under the paper's assumptions)
mmmOutput dimension after sketching
ImI_mIm​m×mm \times mm×m identity matrix
∥x∥2\|x\|_2∥x∥2​Euclidean norm of xxx
N(0,σ2Im)\mathcal{N}(0, \sigma^2 I_m)N(0,σ2Im​)Multivariate normal with zero mean and isotropic covariance σ2Im\sigma^2 I_mσ2Im​
What this distribution statement is *for*: it is not a claim that SxSxSx is "exactly always Gaussian in every finite setting", but a working model that coordinates are similarly scaled and roughly independent. That is what later allows per-angle quantization without modeling a huge cross-angle coupling term.
Step-by-step intuition
1. Variance scales with ∥x∥22\|x\|_2^2∥x∥22​: larger original vectors produce larger coordinate fluctuations after mixing.
2. Independence across coordinates: ImI_mIm​ means coordinates are not coupled in the covariance—this is what makes later per-angle quantization cleaner.
3. Why this helps: mixing reduces axis-aligned outliers; storing radius + angles becomes more natural than per-coordinate scaling metadata.

2) Level-1 angles from adjacent pairs
Intuition first: Same spirit as the slope angle of (a,b)(a,b)(a,b) via tan⁡−1(b/a)\tan^{-1}(b/a)tan−1(b/a). Length is not removed yet—only direction. Pairing coordinates yields d/2d/2d/2 small angles.
ψj(1):=tan⁡−1 ⁣(x2jx2j−1),j=1,…,d/2\psi_j^{(1)} := \tan^{-1}\!\left(\frac{x_{2j}}{x_{2j-1}}\right), \quad j = 1, \ldots, d/2ψj(1)​:=tan−1(x2j−1​x2j​​),j=1,…,d/2
One-sentence meaning: for each adjacent pair (x2j−1,x2j)(x_{2j-1}, x_{2j})(x2j−1​,x2j​), record the angle of the 2D vector in the plane.
Symbol breakdown: ψj(1)\psi_j^{(1)}ψj(1)​ is the local angle at the bottom of the recursion; jjj indexes pairs.
What ψj(1)\psi_j^{(1)}ψj(1)​ is really capturing: it compresses a 2D pair into a direction first, before larger-scale "which half is bigger?" comparisons show up at higher levels.
Intuition: level 1 captures fine, local direction between neighboring coordinates.

3) Level ℓ≥2\ell \ge 2ℓ≥2: angle from left/right block norms
Intuition first: Heavy index notation, but one idea: split a segment into front half vs back half, form (norm of back)/(norm of front), then apply tan⁡−1\tan^{-1}tan−1. Ratio ≈1\approx 1≈1 means balanced energy → angle near π/4\pi/4π/4. Higher ℓ\ellℓ compares larger chunks.
ψj(ℓ):=tan⁡−1 ⁣(∥x(j−1/2)2ℓ+1:j2ℓ∥2∥x(j−1)2ℓ+1:(j−1/2)2ℓ∥2)\psi_j^{(\ell)} := \tan^{-1}\!\left( \frac{\left\|x_{(j-1/2)2^{\ell}+1:j2^{\ell}}\right\|_2} {\left\|x_{(j-1)2^{\ell}+1:(j-1/2)2^{\ell}}\right\|_2} \right)ψj(ℓ)​:=tan−1(​x(j−1)2ℓ+1:(j−1/2)2ℓ​​2​​x(j−1/2)2ℓ+1:j2ℓ​​2​​)
One-sentence meaning: compare the energy of the right half block vs the left half block inside a segment; take tan⁡−1\tan^{-1}tan−1 of their ratio.
Key intuition: if the two halves have similar energy, the ratio is ~1 and the angle lands near π/4\pi/4π/4.
What the ratio is really measuring: numerator/ denominator compares how much vector energy sits in the right chunk vs the left chunk. Near 1 means balanced halves, which maps to an angle near π/4\pi/4π/4 under tan⁡−1\tan^{-1}tan−1.

4) Joint density factorization
Intuition first: “Joint density = product” ≈ independence: you can think about how to quantize radius and how to quantize each angle level mostly separately, instead of one giant coupled quantizer.
fR,Ψd(r,ψd(x))=fR(r)∏ℓ=1log⁡2dfΨ(ℓ) ⁣(ψ(ℓ))f_{R,\Psi_d}(r, \psi_d(x)) = f_R(r) \prod_{\ell=1}^{\log_2 d} f_{\Psi^{(\ell)}}\!\left(\psi^{(\ell)}\right)fR,Ψd​​(r,ψd​(x))=fR​(r)ℓ=1∏log2​d​fΨ(ℓ)​(ψ(ℓ))
One-sentence meaning: the radius and the angle blocks are independent in the joint distribution—multiply the marginals.
Why it matters: you can quantize angles independently in 1D instead of solving one giant coupled problem.
How to read the factorization: writing the joint density as fRf_RfR​ times angle factors is the math way of saying radius quantization design and per-level angle codebooks can be reasoned about mostly separately, instead of one huge coupled quantizer in raw coordinates.

5) Angle density for ℓ≥2\ell \ge 2ℓ≥2 (concentration near π/4\pi/4π/4)
Intuition first: The long formula asks where angle ψ\psiψ is most likely. sin⁡(2ψ)\sin(2\psi)sin(2ψ) is maximized at ψ=π/4\psi=\pi/4ψ=π/4. A large power crushes values away from that peak → probability lives in a thin band near 45∘45^\circ45∘ → few bits work well.
fℓ(ψi(ℓ))=22ℓ−1−2 Γ(2ℓ−2)2Γ(2ℓ−1)sin⁡2ℓ−1−1(2ψi(ℓ))f_{\ell}(\psi_i^{(\ell)}) = \frac{2^{2^{\ell-1}-2}\,\Gamma(2^{\ell-2})^2}{\Gamma(2^{\ell-1})} \sin^{2^{\ell-1}-1}(2\psi_i^{(\ell)})fℓ​(ψi(ℓ)​)=Γ(2ℓ−1)22ℓ−1−2Γ(2ℓ−2)2​sin2ℓ−1−1(2ψi(ℓ)​)
One-sentence meaning: the fraction is a normalization constant; the shape is dominated by a large power of sin⁡(2ψ)\sin(2\psi)sin(2ψ).
Intuition: sin⁡(2ψ)\sin(2\psi)sin(2ψ) is maximized at ψ=π/4\psi=\pi/4ψ=π/4. Raising it to a large power crushes mass away from the peak, so the distribution becomes sharply concentrated—great for low-bit codebooks.
Meaning of the huge exponent: it acts like a sharp filter—only angles extremely close to the maximizer keep meaningful probability. In quantization terms: typical draws live in a narrow band, so a small codebook covers most cases.

6) Angle quantization objective (1D K-means style)
Intuition first: Place a few representative angles (codebook) on a line and round each true angle to the nearest one—classic 1D quantization. bbb bits ⇒ 2b2^b2b bins.
Eψi(ℓ)∼fℓ[∑j∈[2b]:ψi(ℓ)∈Ij(ℓ)∣ψi(ℓ)−θj(ℓ)∣2]\mathbb{E}_{\psi_i^{(\ell)} \sim f_{\ell}} \left[ \sum_{j \in [2^b] : \psi_i^{(\ell)} \in I_j^{(\ell)}} \left| \psi_i^{(\ell)} - \theta_j^{(\ell)} \right|^2 \right]Eψi(ℓ)​∼fℓ​​​j∈[2b]:ψi(ℓ)​∈Ij(ℓ)​∑​​ψi(ℓ)​−θj(ℓ)​​2​
Symbols: bbb bits ⇒\Rightarrow⇒ 2b2^b2b intervals; Ij(ℓ)I_j^{(\ell)}Ij(ℓ)​ intervals; θj(ℓ)\theta_j^{(\ell)}θj(ℓ)​ centroids.
What the objective means: pick bins (IjI_jIj​) and representatives (θj\theta_jθj​) on the line to minimize average squared error when ψ\psiψ is drawn from fℓf_\ellfℓ​—the same "nearest centroid" picture as 1D quantization / K-means on angles.

7) Relative MSE guarantee (schematic)
Intuition first: Larger vectors may tolerate larger absolute error; the promise is relative—“error as a fraction of ∥x∥22\|x\|_2^2∥x∥22​.” Smaller ε\varepsilonε is better.
E[∥x−x′∥22]=ε∥x∥22\mathbb{E}\left[\|x - x'\|_2^2\right] = \varepsilon \|x\|_2^2E[∥x−x′∥22​]=ε∥x∥22​
Intuition: error scales with the squared norm of xxx; ε\varepsilonε controls relative accuracy.
How to read the equality: left side is absolute MSE-style error; right side rewrites it as (scale of xxx)2^22 times a dimensionless ε\varepsilonε. That is a scale-invariant way to promise quality: bigger vectors may tolerate larger absolute error while keeping the same relative quality.

8) Reconstruction from radius + angles
Intuition first: Index iii walks a binary tree; at each level you multiply by either cos⁡ψ\cos\psicosψ or sin⁡ψ\sin\psisinψ depending on left/right. The big formula is that rule in one line. Remember overall scale ∥x∥2\|x\|_2∥x∥2​ once × (cos or sin) per level.
xi=∥x∥2∏ℓ=1log⁡2d(cos⁡ψ⌊i/2ℓ⌋(ℓ))1{(i mod 2ℓ)≤2ℓ−1}∏ℓ=1log⁡2d(sin⁡ψ⌊i/2ℓ⌋(ℓ))1{(i mod 2ℓ)>2ℓ−1}x_i = \|x\|_2 \prod_{\ell=1}^{\log_2 d} (\cos \psi_{\lfloor i / 2^{\ell} \rfloor}^{(\ell)})^{\mathbf{1}\{(i \bmod 2^{\ell}) \le 2^{\ell-1}\}} \prod_{\ell=1}^{\log_2 d} (\sin \psi_{\lfloor i / 2^{\ell} \rfloor}^{(\ell)})^{\mathbf{1}\{(i \bmod 2^{\ell}) > 2^{\ell-1}\}}xi​=∥x∥2​ℓ=1∏log2​d​(cosψ⌊i/2ℓ⌋(ℓ)​)1{(imod2ℓ)≤2ℓ−1}ℓ=1∏log2​d​(sinψ⌊i/2ℓ⌋(ℓ)​)1{(imod2ℓ)>2ℓ−1}
Intuition: the indicators choose whether to multiply by cos⁡\coscos or sin⁡\sinsin at each level when reconstructing coordinate iii.
Reconstruction meaning: you set overall scale with ∥x∥2\|x\|_2∥x∥2​, then walk down the same binary tree used to define angles—each step chooses a left/right branch, implemented as multiplying by either cos⁡ψ\cos\psicosψ or sin⁡ψ\sin\psisinψ for the relevant subtree angle.

9) Attention unchanged
Intuition first: K^,V^\hat{K},\hat{V}K^,V^ are decompressed caches. Use the same softmax and d\sqrt{d}d​ scaling as training—only the tensor contents changed.
softmax⁡ ⁣(K^:i⋅qid)V^:i\operatorname{softmax}\!\left(\frac{\hat{K}_{:i} \cdot q_i}{\sqrt{d}}\right) \hat{V}_{:i}softmax(d​K^:i​⋅qi​​)V^:i​
PolarQuant only changes how KV is stored/dequantized; the attention math stays the same.
What stays the same: the learned attention mechanism; what changes: the tensors feeding it are decompressed KV caches (K^,V^\hat{K},\hat{V}K^,V^) rather than raw FP16 caches.
[Toy Data Walkthrough] (integer-friendly)
A tiny worked example to trace level 1 → level 2 → π/4\pi/4π/4 with real numbers—more important than memorizing symbols.
Let x=(3,4,4,3)x=(3,4,4,3)x=(3,4,4,3). For a pencil-and-paper example, take integer coordinates after preconditioning, x′=(3,4,4,3)x'=(3,4,4,3)x′=(3,4,4,3). (In practice SxSxSx is usually non-integer.)
Why xxx and x′x'x′ match here: in a real run, x′=Sxx' = Sxx′=Sx changes values. We pick the same integers to make norms and ratios clean by hand—the algorithmic steps are unchanged; only the worked numbers are simplified.
Level 1
ψ1(1)=tan⁡−1(4/3),ψ2(1)=tan⁡−1(3/4)\psi_1^{ (1) } = \tan^{-1}(4/3),\quad \psi_2^{ (1) } = \tan^{-1}(3/4)ψ1(1)​=tan−1(4/3),ψ2(1)​=tan−1(3/4)
So ψ1(1)≈0.93\psi_1^{ (1) }\approx 0.93ψ1(1)​≈0.93 rad (≈53∘\approx 53^\circ≈53∘) and ψ2(1)≈0.64\psi_2^{ (1) }\approx 0.64ψ2(1)​≈0.64 rad (≈37∘\approx 37^\circ≈37∘)—both away from 000 and π/2\pi/2π/2.
What the two angles represent: the first is the direction of (3,4)(3,4)(3,4) in its 2D plane; the second is the direction of (4,3)(4,3)(4,3). Each is local pair-level direction, before comparing bigger half-vector energies.
Level 2 (block norms)
∥x1:2′∥2=32+42=5,∥x3:4′∥2=42+32=5\|x'_{1:2}\|_2 = \sqrt{3^2+4^2}=5,\quad \|x'_{3:4}\|_2 = \sqrt{4^2+3^2}=5∥x1:2′​∥2​=32+42​=5,∥x3:4′​∥2​=42+32​=5
The ratio is exactly 111, hence
ψ1(2)=tan⁡−1(1)=π4.\psi_1^{ (2) } = \tan^{-1} (1) = \frac{\pi}{4}.ψ1(2)​=tan−1(1)=4π​.
The 3–4–5 triple makes both half-blocks the same length, so the toy case lands exactly on π/4\pi/4π/4, not just "near" it.
Meaning of ratio =1=1=1: equal norms mean equal energy in the first two coordinates vs the last two; tan⁡−1(1)=π/4\tan^{-1} (1) =\pi/4tan−1(1)=π/4 is the unbiased split angle between the two halves.
Quantization + attention: quantize angles; plug dequantized K^,V^\hat{K},\hat{V}K^,V^ into standard attention.
[Experiments & Engineering] From math to PyTorch, CUDA, and benchmarks
Beautiful formulas naturally raise a practical question: how was this actually implemented, and what did the numbers look like? The PolarQuant authors did not stop at theory—they evaluated the method on real models such as Llama-3.1 under demanding benchmarks. Below: implementation details and four headline experiments.

1. Hardware & implementation
How equations become runnable code matters for reproducibility.
- Hardware: experiments ran on a single NVIDIA RTX A6000 with 48GB VRAM.
- Framework & dtypes: implemented in PyTorch; quantized angle indices were packed into 8-bit storage (torch.uint8).
- Custom CUDA kernels: dedicated kernels accelerate key products such as query × dequantized Key cache and attention scores × dequantized Value cache.
- Shared preconditioner SSS: the random sketch / rotation matrix SSS used for preconditioning was shared across Keys, Values, all layers, and all attention heads—one global mixer, reused everywhere.
- Two codebook strategies: angle quantization used 1D k-means++ clustering.
- Online: recompute angles and re-cluster per prompt (like mixing a fresh seasoning for every order).
- Offline: reuse a precomputed angle codebook for every prompt (like a pre-batched master seasoning).

2. Experiment 1 — Does the angle distribution look “right” on real KV caches?
This checks whether the theoretical angle story matches real KV activations.
- Setup: sample a prompt from Qasper (LongBench), extract KV cache, apply a polar transform with L=4L=4L=4 levels.
- Result: without preconditioning, angles look messy; with preconditioning, outliers shrink. Higher levels show sharp concentration near π/4\pi/4π/4 on real data—not just on paper.

3. Experiment 2 — Needle-In-A-Haystack under extreme context
A classic stress test: after aggressive compression, can the model still retrieve a hidden fact?
- Setup: Llama-3.1-8B-Instruct, context length from 4K up to 104K.
- Fair memory budget: compared against SnapKV, PyramidKV, KIVI, etc., with all methods limited to 25% of the original KV memory (compression ratio 0.25).
- Result: token-dropping methods can miss information; PolarQuant beats KIVI and lands closest to perfect retrieval among the reported baselines.

4. Experiment 3 — LongBench (broad long-context tasks)
Beyond a single needle task: summarization, QA, code, and more.
- Setup: LongBench-V1. Note: caches for newly generated tokens were kept in FP16 for accuracy, as stated in the paper.
- Result: uncompressed (Exact) average 48.63 vs PolarQuant-R (online) 48.37—nearly identical. Reported scores also place it above KIVI (46.70) and SnapKV (44.57).

5. Experiment 4 — Runtime
Quality without speed is hard to deploy.
- Setup: measure time for 16,384 input tokens and 1,024 generated tokens.
- Decode: PolarQuant is about 14% faster than KIVI in the generation phase (as reported).
- Prefill: online codebooks cost about 11.633s prefill time vs 3.364s for offline codebooks—clustering dominates the gap. Because quality is similar, the authors strongly recommend offline codebooks when speed matters.

Bottom line: PolarQuant is validated end-to-end—PyTorch + CUDA + real Llama-class models, tying together distribution checks, extreme-context retrieval, general long-bench scores, and runtime.
Intuition: these numbers reflect careful compression rather than dropping tokens to save memory. Offline codebooks are like a prebuilt angle palette—much faster prefill than online re-clustering per prompt, which matters in production.
[Conclusion & Limitations]
Practical significance
1. PolarQuant shows that quantization does not have to carry normalization metadata forever.
2. It directly targets the memory hotspot of long-context serving.
3. It changes the cache representation without requiring a new attention mechanism.
Limitations
- Codebook construction still leaves room for better analytic designs.
- The paper is strongest on KV-cache quantization; extending the idea to weights or activations needs more evidence.
- Real deployment still depends on efficient kernels, packing layouts, and careful implementation.
[Visualization Plan]
The left panel should depict traditional block quantization: many blocks, each carrying extra helper numbers to reconstruct the stored values. The right panel should depict PolarQuant: random preconditioning, polar conversion, one radius, and highly concentrated angles near 45∘45^\circ45∘.

KV storage at a glance

Legacy stacks FP16 metadata per block; PolarQuant keeps r and angles.

Block quant

Each block still needs extra numbers to decode the short codes, so overhead remains even when values look compressed.
KVINT4+metaFP16× N+FP16 meta / blockmetadata overhead ↑

PolarQuant

After random preconditioning, the method moves to polar coordinates and quantizes concentrated angles instead of storing normalization metadata.
KVSrθcodebookr + θ codebookfootprint ↓

How to read the diagram labels

FP16
Half-precision floating point (16 bits per number). About half the footprint of FP32 for the same count of values; slightly coarser grid of representable numbers.
Quantization
Rounding real values onto a small set of integer codes to save bits. At use time you dequantize; you often need per-block helper numbers to map codes back to the right range.
KV
A chunk of cached Key/Value vectors for past tokens (attention memory).
INT4
Values packed into 4-bit integers—small, but not usable without extra info.
+meta / FP16
High-precision helper numbers (scale, zero-point, etc.) needed to dequantize; stored separately.
× N
Roughly: that metadata repeats for every block, so cost grows with N.
S
Random matrix that mixes coordinates (preconditioning) before the polar transform.
r
Radius: overall magnitude in polar form.
θ
Angle (direction). Often stored as a codebook index instead of a full float.
codebook
A small table of typical angles—like a palette—so you only store an index.
PolarQuant is elegant because it changes the coordinate system of the problem. Instead of forcing raw coordinates into low bits and paying normalization overhead, it stores one radius and a set of structured angles. That makes it especially attractive when KV-cache memory, not model size, is the true serving bottleneck.