Theory · Quantization

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

A random rotation turns any vector into the same easy problem.

Compressing a high-dimensional vector to a couple of bits per number sounds like it should need a model of your data. TurboQuant uses none. One random rotation makes every input look the same, and a fixed per-coordinate quantizer lands within a small constant of the information-theoretic best.

Explaining the paperTurboQuant: Online Vector Quantization with Near-optimal Distortion RateZandieh, Daliri, Hadian, Mirrokni · Google Research · DeepMind · NYU · arXiv:2504.19874 ↗

The same rotate-and-look-up applies the instant a vector arrives, so it keeps pace with a KV cache that grows every token or a vector database with a billion entries, no codebook to train and no calibration pass.

Modern AI runs on vectors. A language model's key/value cache is a pile of them, one per token per layer, growing without bound as the context gets longer. A vector database is millions or billions of them, sitting in memory waiting to be searched. Every one of those vectors is a list of floating-point numbers, sixteen or thirty-two bits each, and the bill comes due twice: in the memory to store them and in the time to move them between fast and slow memory on an accelerator. You can spend fewer bits per number, but throwing away bits throws away accuracy, and you kept those vectors precisely because their geometry, their lengths and the angles between them, carried the answer.

TurboQuant, from Google Research, DeepMind, and NYU, spends those bits close to the information-theoretic limit. It compresses each coordinate to a few bits while provably staying within a small constant factor of the least distortion any scheme could achieve, at every bit budget. And it does this online: no training pass, no calibration on your data, a fixed recipe you apply the instant a vector arrives. The argument is short and mostly geometric, built from four ideas: what we are trying to minimize, why a random rotation makes the hard case disappear, how to quantize one number optimally, and why inner products need a second, separate stage.

Storing a vector in a few bits

We have a $d$ -dimensional vector $\mathbf{x}$ and a budget of $b$ bits per coordinate, the bit-width. A quantizer is a pair of maps: an encoder $Q$ that turns $\mathbf{x}$ into a short binary code, and a decoder $Q^{-1}$ that turns the code back into a reconstructed vector $\tilde{\mathbf{x}}$ . The reconstruction is never exact (you spent fewer bits than the original floats held), so the only question is how much you lose. The paper tracks two different losses, because the two things people do with these vectors care about different errors.

The first is plain reconstruction error, the mean-squared error (MSE): how far the reconstructed vector sits from the original, squared and averaged over the quantizer's own randomness.

D_{\tt mse} := \mathop{\mathbb{E}}_{Q}\Big[\,\big\lVert \mathbf{x} - Q^{-1}(Q(\mathbf{x})) \big\rVert_2^2\,\Big]

(1)

The second is about inner products. Most of what these vectors are for is dot products: attention scores a query against every key by their inner product, and a similarity search ranks database vectors by their inner product with the query. So what you really want preserved is $\langle \mathbf{y}, \mathbf{x}\rangle$ for a query $\mathbf{y}$ , even after $\mathbf{x}$ has been crushed to a few bits:

D_{\tt prod} := \mathop{\mathbb{E}}_{Q}\Big[\,\big|\langle \mathbf{y}, \mathbf{x}\rangle - \langle \mathbf{y}, Q^{-1}(Q(\mathbf{x}))\rangle\big|^2\,\Big]

(2)

And for inner products there is a second, sharper demand than being close. The estimate should be unbiased: averaged over the quantizer's randomness, it should equal the truth exactly, with no systematic tilt.

\mathop{\mathbb{E}}_{Q}\big[\langle \mathbf{y}, Q^{-1}(Q(\mathbf{x}))\rangle\big] = \langle \mathbf{y}, \mathbf{x}\rangle

Unbiasedness matters when you sum or average many estimates, which is exactly what a search or an attention head does. A small random error in each term washes out across the sum; a small bias in each term, all pointing the same way, accumulates instead of cancelling, and shifts every score by the same amount. Because of that distinction, the paper needs two algorithms instead of one.

Two constraints define the problem. First, no assumptions about the data: the vector $\mathbf{x}$ is worst-case, an adversary's choice, not a draw from some friendly distribution. Second, and this is the constraint that rules out most prior work, the quantizer must be data-oblivious, also called online. It cannot study your dataset first. Why insist on that? Because the alternative, a data-dependent quantizer that learns a codebook from your vectors (the standard move, usually k-means), needs a slow preprocessing pass and falls apart the moment the data shifts, which for a streaming KV cache is every single token. A data-oblivious quantizer applies the same fixed map to whatever arrives, instantly, so it can ride along with generation. In exchange, it has to be good against the worst case, with no chance to adapt. TurboQuant meets that worst case with a single matrix multiply.

Rotate the worst case away

A worst-case vector causes a concrete problem. Suppose $\mathbf{x}$ is a spike: one coordinate near 1, the rest near 0. A per-coordinate quantizer that spends its bits assuming values are spread out will waste almost all of them on the zeros and badly mangle the one coordinate that mattered. Some other input might be spread evenly. There is no single fixed per-coordinate scheme that is good for both, because the inputs don't agree on what a typical coordinate looks like.

There is one step: multiply $\mathbf{x}$ by a random rotation $\boldsymbol{\Pi}$ before quantizing. A random rotation is an orthogonal matrix drawn uniformly (you build one by taking the QR decomposition of a matrix of i.i.d. Gaussians, exactly as the paper does). Rotation preserves all the geometry that matters, every length and every angle, so it changes nothing about the inner products we are trying to keep. What it changes is the coordinates. Whatever $\mathbf{x}$ looked like, the rotated vector $\boldsymbol{\Pi}\mathbf{x}$ is a uniformly random point on the sphere of the same radius, and a uniformly random point has no special coordinates at all. The spike is gone, smeared evenly across all $d$ directions. The rotation converts every worst-case input into the same average-case input, so one fixed quantizer can be near-optimal for all of them, and a data-oblivious scheme works.

And the resulting coordinates are not merely spread out vaguely; they follow an exact, known law. For a unit vector, each coordinate of $\boldsymbol{\Pi}\mathbf{x}$ has the density

f_X(x) = \frac{\Gamma(d/2)}{\sqrt{\pi}\,\Gamma((d-1)/2)}\,(1-x^2)^{(d-3)/2}, \qquad x \in [-1, 1]

(Lemma 1)

which is a (rescaled) Beta distribution, and the derivation is one line of geometry: a point at coordinate value $x$ on the unit sphere lives on a sub-sphere of radius $\sqrt{1-x^2}$ in one lower dimension, so the density is the surface area of that sub-sphere, which is where the $(1-x^2)^{(d-3)/2}$ comes from. In high dimensions this bell concentrates and converges to a Gaussian $\mathcal{N}(0, 1/d)$ : mean zero (the coordinates of a sphere point are symmetric about zero, so a coordinate is as likely positive as negative), variance $1/d$ (the $d$ coordinate-squares are symmetric and sum to 1, so each averages $1/d$ ).

Figure 1 shows it. For any input shape, a spike, a few spikes, a ramp, the coordinates after a real random rotation follow the same bell, no matter what went in, and raising the dimension sharpens that bell toward the Gaussian.

Figure 1 · one rotation, the same coordinates

dim d64

Any unit vector x, after a random rotation Πx, becomes a uniform point on the sphere. Change the input shape and the top stems change completely; the histogram of rotated coordinates does not move, it always follows the same bell

f_X

(which sharpens to

\mathcal{N}(0,1/d)

as d grows).

The next step leans on a subtler fact: in high dimensions, distinct coordinates of a sphere point become nearly independent, not merely spread the same way. That is a real theorem, the Diaconis-Freedman result that the first few coordinates of a high-dimensional sphere point look like independent Gaussians, though it is a high-dimensional approximation, tightest when you look at far fewer coordinates than the dimension. Independence allows the coordinates to be quantized one at a time and still be near-optimal overall.

Quantize one number at a time

Near-independence collapses a hard $d$ -dimensional problem into the same easy one-dimensional problem, repeated $d$ times. Each coordinate is a draw from the bell $f_X$ ; quantize it on its own and add up the errors. The design reduces to one question. Given the bell and a budget of $b$ bits, where do you put the $2^b$ reconstruction values, the code pointsor centroids, to minimize squared error?

This is a standard problem, solved by the Lloyd-Max quantizer, one-dimensional $k$ -means. Its solution obeys two conditions that lock together. Each value should map to its nearest code point, so the cells are intervals split at the midpoints between neighbouring centroids (a Voronoi partition of the line). And each code point should sit at the mean of its cell, because the mean is the value that minimizes squared error within the cell. Written as the optimization the paper solves once and stores:

\mathcal{C}(f_X, b) = \min_{c_1 \le \cdots \le c_{2^b}} \sum_{i=1}^{2^b} \int_{\frac{c_{i-1}+c_i}{2}}^{\frac{c_i+c_{i+1}}{2}} |x - c_i|^2\, f_X(x)\, dx

(4)

Because the bell is the same for every coordinate and every input, this is solved once, offline, for each bit-width $b$ , and the resulting code points are baked into a tiny lookup table. Quantizing is then two steps: rotate, then for each coordinate find the nearest table entry. No per-data work, and it vectorizes on a GPU, so it runs efficiently on an accelerator.

The simplest case is the one-bit quantizer: at $b=1$ you get a single bit per number: two code points, and by symmetry they sit at $\pm\sqrt{2/\pi}$ times the coordinate's scale (the average magnitude of a Gaussian is $\sqrt{2/\pi}$ ). One bit keeps only the sign of each rotated coordinate and a fixed magnitude. Even that recovers most of the vector: the per-vector MSE at one bit is $1 - 2/\pi \approx 0.36$ of the vector's squared length, so the reconstruction keeps the other 64%. As $b$ rises the cells multiply and the error falls.

Figure 2 · the optimal scalar quantizer

bits b24 levels · MSE 0.117/d

The coordinate density (teal bell). With

b

bits there are

2^b

code points placed by Lloyd-Max: each cell is split at the midpoints between neighbours, and each code point sits at its cell's mean. Drop a value anywhere; it snaps to the nearest code point and the amber span is the error. More bits, finer cells, smaller error.

# TurboQuant (MSE): quantize unit vector x to b bits / coordinate
Pi   = random_rotation(d)         # QR of a Gaussian matrix; fixed, shared
y    = Pi @ x                     # now every coordinate ~ N(0, 1/d)
idx  = nearest(y, codebook[b])    # Lloyd-Max table, precomputed once
# dequantize: look up the centroids, rotate back to the original basis
x_hat = Pi.T @ codebook[b][idx]   # idx is b bits per coordinate

That is the MSE quantizer end to end: a shared random rotation, a precomputed codebook, a nearest-neighbour lookup per coordinate, and the transpose rotation to decode. The open question is how close to optimal it is.

How close to optimal?

Adding up the per-coordinate errors gives the vector's total distortion, $D_{\tt mse} = d\cdot\mathcal{C}(f_X, b)$ , and the paper bounds it in closed form. For larger bit-widths the classical high-resolution formula (the fine-grained, many-bits regime where cells are small), due to Panter and Dite, for a Gaussian-shaped source gives the result:

D_{\tt mse} \le \frac{\sqrt{3}\,\pi}{2}\cdot\frac{1}{4^b} \approx \frac{2.72}{4^b}, \qquad \text{and at small } b:\ D_{\tt mse} \approx 0.36,\ 0.117,\ 0.03,\ 0.009

(Thm 1)

The distortion falls like $1/4^b$ , so every extra bit divides the error by four. The factor is four rather than two because one more bit halves the width of every cell, and squared error scales with width squared, so halving the width quarters the error. Four bits per number already drives the reconstruction error below 1% of the vector's squared length.

A small error is only part of the claim. The rest is that you cannot do meaningfully better with any quantizer. The Shannon Lower Bound, the rate-distortion floor from information theory, says that at $b$ bits per coordinate no scheme on this sphere can beat

D_{\tt mse} \ge \frac{1}{4^b}

(Thm 3)

TurboQuant reaches at most $\frac{\sqrt{3}\pi}{2}\cdot\frac{1}{4^b}$ ; nothing can beat $\frac{1}{4^b}$ . The gap is the constant $\sqrt{3}\pi/2 \approx 2.7$ , the same at every bit-width. So this fixed, training-free, one-rotation-and-a-lookup scheme is never worse than $2.7\times$ the unbeatable optimum, and at one bit, where the bound is sharpest, only $1.45\times$ . The figure plots both bounds on a log scale, where the constant gap is a fixed vertical offset between two parallel lines, and drops TurboQuant's actual distortion between them.

Figure 3 · near-optimal at every bit-width

bits b1gap 1.45× from optimal

Distortion against bit-width, log scale. The Shannon floor

1/4^b

is the best any quantizer can do; the dashed line is TurboQuant's guarantee,

2.72/4^b

. The amber dots are what it actually achieves. They hug the floor: the gap is

1.45\times

at one bit and never worse than

2.7\times

. Both bounds are straight with the same slope, because each bit divides the error by four.

This is the "near-optimal distortion rate" in the title, and it is the part that was open. Data-oblivious methods used to pay for their convenience with much looser bounds; TurboQuant gets the convenience and the near-optimal rate at once. Two caveats apply: the $1/4^b$ floor and the Gaussian-shaped bound are both high-dimensional limits (at finite $d$ the coordinate bell is exactly Beta, only approximately Gaussian), so "within $2.7\times$ " is an asymptotic statement, not a finite-size equality. It matches experiment closely all the same.

Inner products need a second stage

Everything so far minimizes reconstruction error. A vector that is close in MSE does give close inner products, but not unbiased ones, and bias is the error that accumulates across a sum. The problem is clearest at one bit, where the MSE quantizer keeps only the sign: $Q_{\tt mse}(\mathbf{x}) = \mathrm{sign}(\boldsymbol{\Pi}\mathbf{x})$ and decodes to $\sqrt{2/(\pi d)}\,\boldsymbol{\Pi}^{\!\top}\mathrm{sign}(\boldsymbol{\Pi}\mathbf{x})$ . The expected inner product of that reconstruction with a query is:

\mathop{\mathbb{E}}\big[\langle \mathbf{y}, Q_{\tt mse}^{-1}(Q_{\tt mse}(\mathbf{x}))\rangle\big] = \frac{2}{\pi}\,\langle \mathbf{y}, \mathbf{x}\rangle \approx 0.64\,\langle \mathbf{y}, \mathbf{x}\rangle

The estimate is systematically too small, shrunk by $2/\pi$ . The shrinkage traces back to exactly the property MSE optimizes for: a Lloyd-Max code point is the mean of its cell, the conditional average of the values that land there, and an average is pulled in from the extremes toward the center. Replacing each coordinate by its cell-mean keeps the direction but systematically underestimates the magnitude. The shrinkage has an exact form at every bit-width. Because the optimal reconstruction is the mean of its cell, the quantization error is orthogonal to the code, so the expected product of a coordinate and its code equals the code's variance, which is the coordinate's variance minus the MSE. With unit-variance coordinates that makes the slope exactly $\alpha(b) = 1 - D_{\tt mse}(b)$ , and the bias is whatever the MSE threw away. At one bit $\alpha = 1 - 0.36 = 2/\pi$ ; by four bits $\alpha = 0.99$ and the bias is nearly gone. The figure shows the amber estimates lying along that shallow line, fanning further below the diagonal as the true inner product grows, which is the bias accumulating.

TurboQuant pays for the bias directly, with a separate one-bit tool built for inner products. That tool is QJL, the Quantized Johnson-Lindenstrauss transform (by two of the same authors). The Johnson-Lindenstrauss lemma is the classical fact that projecting vectors onto a small number of random directions approximately preserves their lengths and inner products; QJL pushes this to the limit by keeping only the sign of each projection. It sketches a vector, builds a tiny lossy summary that still answers an inner-product query, against random Gaussian directions, the rows of a matrix $\mathbf{S}$ , and keeps only the sign of each projection, $Q_{\tt qjl}(\mathbf{x}) = \mathrm{sign}(\mathbf{S}\mathbf{x})$ ; each row produces one sign bit, so the sketch costs one bit per direction. It then decodes with a precise scale:

Q_{\tt qjl}^{-1}(\mathbf{z}) = \frac{\sqrt{\pi/2}}{d}\,\mathbf{S}^{\!\top}\mathbf{z}, \qquad \mathop{\mathbb{E}}\big[\langle \mathbf{y}, Q_{\tt qjl}^{-1}(Q_{\tt qjl}(\mathbf{x}))\rangle\big] = \langle \mathbf{y}, \mathbf{x}\rangle

That $\sqrt{\pi/2}$ is the exact inverse of the $\sqrt{2/\pi}$ shrinkage a sign introduces, so it cancels the bias and leaves an unbiased estimator. QJL is unbiased but noisy (a single bit per direction carries little information), with variance $\frac{\pi}{2d}\lVert\mathbf{y}\rVert^2$ . Using QJL alone is unbiased but high-variance; using MSE alone is low-variance but biased. TurboQuant's inner-product quantizer combines them in two stages.

Stage one: run the MSE quantizer, but at $b-1$ bits, saving one bit from the budget. This captures the bulk of $\mathbf{x}$ cheaply and accurately, and what it misses is a small residual $\mathbf{r} = \mathbf{x} - Q_{\tt mse}^{-1}(Q_{\tt mse}(\mathbf{x}))$ , a vector with small norm. Stage two: spend the last bit applying QJL to that residual. The estimate adds the two pieces:

\langle \mathbf{y}, \tilde{\mathbf{x}}_{\tt mse}\rangle + \lVert\mathbf{r}\rVert\cdot\langle \mathbf{y}, Q_{\tt qjl}^{-1}(Q_{\tt qjl}(\mathbf{r}))\rangle

The sum is unbiased because the second term's expectation is exactly the piece the first term dropped. QJL is unbiased on the residual, so on average the second term equals $\langle\mathbf{y}, \mathbf{r}\rangle$ , the missing piece, and the two add back to the whole:

\langle\mathbf{y}, \tilde{\mathbf{x}}_{\tt mse}\rangle + \langle\mathbf{y}, \mathbf{r}\rangle = \langle\mathbf{y}, \tilde{\mathbf{x}}_{\tt mse} + \mathbf{r}\rangle = \langle\mathbf{y}, \mathbf{x}\rangle

The biased main term plus the unbiased correction of its own leftover is unbiased. And because QJL's variance scales with the norm of what it sketches, and the residual's norm is small, the noise it adds is small too: the inner-product distortion comes out to $D_{\tt prod} = \frac{\pi}{2d}\lVert\mathbf{y}\rVert^2\,D_{\tt mse}(b-1)$ , which, for a unit-norm query, lands at $1.57/d,\, 0.56/d,\, 0.18/d,\, 0.047/d$ for $b=1,2,3,4$ and again sits within a small constant of the inner-product lower bound.

Figure 4 · biased MSE vs unbiased two-stage

bits b1

Estimated inner product against the true one. The dashed diagonal is unbiased. The MSE quantizer shrinks every estimate by

\alpha(b)=1-D_{\tt mse}(b)

, so its cloud lies along a shallower line and the gap grows with the true value (at one bit

\alpha=2/\pi

). The two-stage quantizer is centered on the diagonal at every value. Slide

b

and the bias shrinks toward zero.

# Inner-product TurboQuant: unbiased, b bits / coordinate
idx   = quant_mse(x, bits = b - 1)         # coarse code (b-1 bits)
r     = x - dequant_mse(idx)               # leftover residual, small norm
signs = sign(S @ r)                        # sign ignores scale: sketches r/||r||
store  = (idx, signs, norm(r))             # (b-1) + 1 bits + one scalar

# estimate <y, x> from the stored code:
est_mse = dot(y, dequant_mse(idx))
est_qjl = norm(r) * (sqrt(pi / 2) / d) * dot(y, S.T @ signs)
est     = est_mse + est_qjl                # unbiased: E[est] = <y, x>

The two algorithms share one core. For small reconstruction error, use the MSE quantizer; for unbiased inner products, spend one bit on a QJL correction to its residual and the bias is gone.

What it buys

The theory gives near-optimal distortion at any bit-width with no preprocessing. The experiments test it in the two places these vectors actually live.

The KV cache. A decoder model stores a key and value vector for every past token; at long context that cache dominates memory and the cost of moving it dominates latency. Quantizing it has to be online (tokens stream in one at a time) and has to preserve inner products (attention is a softmax over query-key dot products), both of which TurboQuant handles. On the needle-in-a-haystack test, hide one fact in a long document and ask the model to retrieve it, Llama-3.1-8B scores 0.997 out of a perfect 1.0 even with the cache squeezed to a quarter of its size ( $4\times$ , about 4 bits per number), matching the uncompressed model, while methods without distortion guarantees lose tokens and degrade recall. On the broader LongBench-E suite, a battery of long-context language tasks, quality survives down to roughly 2.5 bits per channel (a channel is one coordinate of the per-token vector), effectively lossless by 3.5 bits, using a small twist: a handful of high-magnitude "outlier" channels are kept at higher precision and the rest lower, averaging to the target budget (for a clean 2.5-bit head, 64 of the 128 channels at 3 bits and 64 at 2). This delivers long-context quality at a quarter to a sixth of the KV-cache memory.

Nearest-neighbour search. A vector database compresses its vectors with "product quantization" (PQ), which learns a k-means codebook over the dataset, and RaBitQ, a recent binary scheme with guarantees. Both spend real time building an index: PQ runs k-means, RaBitQ does heavier per-vector work with no GPU-friendly kernel. TurboQuant has no codebook to learn, so its indexing is a rotation and a lookup, and the timing gap is large. The figure shows quantization time on a log scale, because otherwise TurboQuant's bar would be invisible: tens of thousands of times faster than the baselines, and it still wins on recall.

Figure 5 · indexing time, essentially zero

dimension

Quantization time (paper Table 2, 4-bit, log scale). Product Quantization trains a codebook; RaBitQ does heavy per-vector work with no GPU kernel; TurboQuant rotates and looks up. Pick the dimension and read the speedup, often more than

100{,}000\times

, because there is no codebook to learn.

Once every input is a uniform point on the sphere, every coordinate obeys the same bell, and the optimal scalar quantizer for that bell is solved once and read from a table, landing within $2.7\times$ of the information-theoretic floor at every bit-width. The inner-product case needs the one extra repair: MSE-optimal codes shrink dot products by a fixed factor, and a one-bit QJL sketch of the leftover residual adds back exactly what was shrunk away. It needs no training, no calibration, and no codebook, relying only on the geometry the vectors already carry.

Provenance Verified against primary literature

TurboQuant (2025)The source paper: the rotation trick, the two quantizers, the bounds, the experiments.

QJL (2024)The 1-bit sign-sketch used on the residual; the √(π/2) debias factor and π/2 variance.

Lloyd-Max / Gray-NeuhoffOptimal scalar quantization; the √3π/2 Gaussian high-resolution constant.

Shannon / Cover-ThomasThe rate-distortion lower bound that pins the 1/4^b floor.

Diaconis-Freedman (1987)Near-independence of sphere coordinates (the paper cites Vershynin’s textbook).

correctionThe paper’s overview writes the coordinate limit as N(1, 1/d); it is N(0, 1/d) (mean 0 by symmetry, as Lemma 1 states). Its printed 2.5-bit channel split (32×3 + 96×2)/128 sums to 2.25, not 2.5; a clean 2.5-bit split is 64×3 + 64×2. And RaBitQ is CPU-vectorized, so the fair claim is that it lacks a GPU-friendly kernel, not vectorization. We teach the corrected versions.

Questions you might still have

Why rotate the vector at all?
A worst-case input concentrates its mass on a few coordinates, and no fixed per-coordinate quantizer is good for every such input. A random rotation turns any input into a uniform point on the sphere, so every coordinate follows the same bell. One fixed quantizer is then near-optimal for all of them, with no per-data tuning, and rotation preserves all lengths and angles so the geometry is untouched.

Why does the MSE-optimal quantizer get inner products wrong?
Its reconstruction is the mean of each cell, the conditional average of the values that land there, which pulls every value in toward the center. So it shrinks the vector by a factor 1 − MSE; at one bit that factor is 2/π ≈ 0.64, and every inner product comes out about 36% too small. The shrinkage is a systematic bias, not random noise, so it accumulates across a sum. The 1-bit QJL stage on the residual adds back exactly what was shrunk away.

What does "within 2.7× of optimal" actually mean?
The Shannon lower bound says no quantizer at b bits per coordinate can beat distortion 1/4^b on this sphere. TurboQuant provably reaches at most (√3π/2)·1/4^b ≈ 2.72/4^b, a fixed multiple of the floor at every bit-width, and only 1.45× at one bit. It is an asymptotic (high-dimensional) statement, and experiments match it closely.

Why is its indexing time basically zero?
There is no codebook to learn. Product Quantization runs k-means over your dataset; TurboQuant applies one random rotation (shared across all vectors) and a table lookup per coordinate, both data-oblivious. So quantizing a million vectors is milliseconds instead of minutes, so it works for a streaming KV cache.

Footnotes & further reading

The paper: Zandieh, Daliri, Hadian, Mirrokni, TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Google Research / DeepMind / NYU, 2025).
The one-bit inner-product tool used in stage two: Zandieh, Daliri, Han, QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead (2024).
Optimal scalar quantization and its high-resolution distortion: Lloyd and Max (1957-1960), surveyed in Gray & Neuhoff, Quantization (IEEE Trans. Information Theory, 1998), which gives the Panter-Dite Gaussian constant $\sqrt{3}\pi/2$ .
The rate-distortion floor (Shannon Lower Bound) is standard rate-distortion theory; see Cover & Thomas, Elements of Information Theory, Ch. 13.
That distinct coordinates of a high-dimensional sphere point are nearly independent Gaussians is the Diaconis-Freedman (1987) / Stam (1982) result; the paper cites Vershynin, High-Dimensional Probability (2018), which proves the single-coordinate limit and pairwise uncorrelatedness.
The nearest-neighbour baselines: Jégou, Douze, Schmid, Product Quantization for Nearest Neighbor Search (2011), and Gao & Long, RaBitQ (SIGMOD 2024).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.