SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
A few outlier channels wreck 8-bit quantization. Scale the problem into the weights, which quantize cleanly.
A handful of activation channels carry values a hundred times larger than everything else, and those outliers break 8-bit quantization. A rescale that leaves the layer's output identical moves the difficulty onto the weights, which quantize cleanly to 8 bits, so the matmul runs in integer arithmetic.
Explaining the paperSmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language ModelsRun a 175-billion-parameter model in 8-bit integers and it collapses to gibberish. Eight bits is plenty; the trouble is that a few activation values run a hundred times larger than the rest, and no single ruler can measure both the giants and the crowd.
A large language model is expensive mostly because it is large. GPT-3 has 175 billion parameters, which is 350 GB just to hold the weights in FP16 (16-bit floating point, the usual serving format), enough to need eight 48 GB GPUs or five 80 GB A100s before you have run a single token. That memory, and the bandwidth to move it, is the binding cost of serving one of these models.
The standard lever is quantization: store numbers in fewer bits. Going from 16-bit floats to 8-bit integers (INT8) halves the bytes, which halves the memory and roughly doubles the memory bandwidth you get per second. On an NVIDIA A100 the integer matmul units also run at twice the rate of the float ones (624 trillion INT8 ops per second against 312 trillion FP16, both dense, no sparsity tricks). So INT8 promises about half the memory and about twice the throughput. For a model whose whole problem is size, that is the difference between eight GPUs and four.
There is a condition on the throughput half of that promise, and it drives the entire paper: you only get the integer matmul if both inputs to it are integers. The weights and the activations both have to be INT8. Quantizing just the weights is easy and everyone does it, but it buys memory, not speed. This paper closes the gap between what quantizes easily and what actually runs fast. Its method, SmoothQuant, is training-free (no fine-tuning, no gradient steps, just a rewrite you apply once), keeps every matmul in INT8, and loses almost no accuracy on models up to 530 billion parameters.
A few ideas explain it: how INT8 quantization actually works and why a single outlier ruins it, why the obvious per-channel fix is illegal on the hardware, the exact rescale that moves the problem out of the activations, and how far to move it. None is hard on its own.
Why 8-bit needs both sides
Start with the throughput condition, because it is the constraint everything else bends around. A linear layer, the workhorse inside every Transformer block, computes : a matrix of activations times a weight matrix . Write the shapes down, because the axes matter later:
Here is the number of tokens in the batch, the input channels (the width of each token vector coming in), and the output channels, so the result is . Each output entry is a sum over the shared axis, . That inner summation axis is where the difficulty will concentrate.
There are two ways to quantize this layer. In W8A16 you store the weights in INT8 but leave the activations in FP16. That halves the weight memory, which genuinely helps when you are decoding one token at a time and the run is bottlenecked on reading weights from memory. But it does nothing for the matmul itself, because the INT8 weights get converted back to FP16 before the multiply happens, and the multiply then runs on the FP16 units at the FP16 rate, so you save memory but not compute.
In W8A8 both operands are INT8, so the multiply runs on the integer units and you get the 2× arithmetic. Both halves of the promise, but only if the activations survive being squeezed into 8 bits. Weights do; activations, in a large model, do not. Understanding why takes one look at what an 8-bit quantizer actually is.
How INT8 quantization works
Quantization lays down a ruler with evenly spaced ticks and snaps every real number to the nearest tick. For symmetric 8-bit integers the ruler is centered at zero and runs from to (the paper uses this restricted range, dropping so the codes are symmetric about zero). The only knob is the tick spacing, the step size :
Read it left to right: divide every value by the step, round to the nearest integer (that is the bracket), and you have a code in . The step itself is set so that the largest-magnitude value in the tensor lands exactly on the last tick: . You could instead clip the outlier down to a smaller maximum and give the crowd finer ticks, but those outlier channels carry information the model relies on, so corrupting them costs accuracy. Keeping them ties the tick spacing to a single number: that one maximum fixes the step for every other value.
Two details matter before moving on. First, the ruler can be measured once offline from a few calibration batches (static quantization, cheapest at runtime) or recomputed from each incoming tensor (dynamic, more accurate, slightly slower). Second, you can share one ruler across the entire matrix (per-tensor), or give each token its own (per-token), or each output channel its own (per-channel). The finer the grain, the more faithful the quantization, but also the more scale factors you store, and, for the wrong axis, a scale the integer matmul cannot use at all.
A few channels blow up
One fact makes large-model quantization hard, and it is easier to see than to take on faith. Plot the activations going into a linear layer of a big model and the weights are flat and mild, a gentle spread easily covered by 8 bits. The activations are not. A few channels carry values that tower over everything else, and they are the same few channels in every token.
Three properties, all visible above, define these activation outliers. They are large, on the order of a hundred times the typical activation value. They live in a small number of channels, a fraction of a percent of the width. And they are persistent: if a channel is an outlier, it is an outlier in every token and every sequence, so its column stays bright all the way down. It is like a choir where the same few singers, in the same seats, drown out everyone else in every song.
This is not a quirk of one model. Dettmers and colleagues, in the LLM.int8() paper, traced these outliers appearing as a phase transition with scale: below a few billion parameters they are mild, and by around 6.7 billion they have taken over, showing up in essentially every layer at once. It is exactly the regime, models with tens or hundreds of billions of parameters, where you most want quantization and where naive quantization most completely fails. One number to hold onto: OPT-175B scores 66.9% average accuracy in FP16, and naive INT8 drops it to 35.5%, which for these zero-shot tasks is no better than random guessing.
A note on the "hundred times" figure, since two numbers float around and it is easy to think they disagree. SmoothQuant's ~100× is the outlier measured against the typical activation. LLM.int8() quotes a smaller ~3-20×, measured against the largest other channel. Both are right; they just pick different baselines, and either way the outlier dwarfs the crowd, which is all the next step needs.
How one outlier wastes the bits
Now connect the outlier to the ruler. Per-tensor quantization sets one step size from the global maximum, and the global maximum is the outlier. So the ruler stretches to reach a value a hundred times out, its ticks spread a hundred times apart, and the ordinary values, which all live near zero, are left sharing only the few ticks nearest the center.
The paper puts a number on the damage. For an ordinary channel with maximum magnitude inside a tensor whose global maximum is , the fraction of the 256 levels that channel can actually reach is:
When the outlier makes about a hundred times an ordinary channel's , that is roughly levels. Note the word: levels, not bits. Bits are just the base-2 logarithm of the number of distinguishable levels, so two or three levels is a little over one bit of real precision (), out of a nominal eight. You paid for INT8 and, for most of the tensor, you are getting worse than INT2. Drag the outlier up in the figure below and watch an ordinary channel's values collapse onto two or three grid lines, rounding into each other until they are indistinguishable.
The bits are all being spent on the empty gap between the crowd and the outlier. The obvious repair is to stop making everyone share one ruler.
The fix the GEMM blocks
If a few channels are the problem, give each channel its own ruler. That is per-channel quantization, and for the outlier channel it is exactly right: measure it against its own maximum, and its ordinary neighbors keep their precision. Per-channel quantization of the weights is standard and completely fine. The activations are where it breaks down, and not because the feature is missing: the obstacle lives in the shape of the matmul.
Recall that each output is a sum over the input-channel axis, . After an INT8 matmul you have to undo the quantization, multiplying by the step sizes to get back to real numbers. A factor that does not depend on the summation index is a constant as far as the sum is concerned, so it slides straight out in front. A per-token activation scale does not depend on , so it factors out to the left. A per-output-channel weight scale does not depend on either, so it factors out to the right. That is exactly the dequantization the fast kernels support:
A per-input-channel activation scale will not factor, because is the summation index itself. It rides inside the sum, , with a different value on every term, and there is no way to pull it out past the accumulation. The integer matmul accumulates in one pass and applies its scales once at the end; a per-input-channel scale would demand a separate floating-point multiply on every term before adding, which is precisely the integer GEMM you were trying to run. The barrier is the axis, not the granularity. Pick a scheme in the figure and watch where its scale lands relative to the sum.
Every prior method got stuck in this corner. You can have the precision of per-channel activation scaling, or you can have the speed of the integer matmul, but not both, because the scale that would tame the outlier rides the very axis being summed over. LLM.int8() took the precision and paid for it: it detects the outlier channels at runtime, peels them off into a separate FP16 matmul, and runs the rest in INT8, then adds the two back together. That keeps the accuracy (it is nearly lossless), but the outlier columns are scattered and irregular, the extra floating-point matmul plus the gather and scatter around it are expensive, and in SmoothQuant's own benchmarks the result usually runs slower than plain FP16, recovering the accuracy but losing the speedup it was meant to deliver.
SmoothQuant avoids the trade. It gets per-channel-quality quantization of the activations without ever needing a per-input-channel activation scale at runtime. The way through is to notice that the difficulty does not have to stay on the activations at all.
Migrate the difficulty
The activations are hard to quantize and the weights are easy, and they are multiplied together. So move some of the hardness across the multiply. Insert a per-input-channel factor , divide the activations by it, and multiply the weights by it. On the shared channel axis, dividing then multiplying by the same numbers cancels, so the layer computes the identical output:
The in the middle is the identity; it sits at the seam between and and does nothing to the product. It is the algebra of multiplying and dividing a fraction by the same number, , applied one channel at a time. In components, shrinks the activation in channel , and grows the corresponding weight row to compensate. Pick a large for an outlier channel and its activation spike shrinks while the flat weight there absorbs the difference.
One thing this transform is not is an approximation. In floating point, equals exactly, to the bit. Nothing is lost by rescaling. The only error anywhere in SmoothQuant is the INT8 rounding that comes afterward, and the entire point of the rescale is to make that rounding easy: instead of one tensor with a hundred-fold spike, you now have two tensors that are each mild. The term for what just happened is migrating the quantization difficulty. The activation's outlier problem does not get solved so much as shared, pushed partway onto a weight matrix that had precision to spare.
Below is the migration itself, one bar per channel: activations on top, weights below. Raise the strength and the outlier channel's spike slides from the amber activation into the teal weight, the two meeting balanced in the middle.
Both extremes fail, for the same reason the middle works. At every channel keeps its activation spike and you are back to the original broken problem; at the spike lands entirely on the weights and their quantization collapses instead. Only a split in the middle tames both, and finding the right split is the last piece.
How much to migrate
How much difficulty to move is a single dial, the migration strength , and the paper sets the smoothing factor from the per-channel maxima of both tensors:
The exponents split a channel's difficulty on a log scale:
To see what the balance does, push the formula through. After smoothing, the two channel maxima come out as:
Those two are equal exactly when , so only at do the smoothed activation and weight share the same per-channel maximum: both land on , the geometric mean of the original maxima. (A caution, because it is a natural slip: the geometric mean is the value of the smoothed maxima, not of itself. The factor is a ratio of maxima, at .)
Make it concrete on the outlier channel. Say it has activation maximum 70 and weight maximum 0.34. At , . The activation drops from 70 to , and the weight rises from 0.34 to . Both at 4.9. The per-tensor step now spans only to 4.9 instead of 70, so plugging the new ceiling back into the effective-levels count shows an ordinary channel with maximum near 1 now getting about levels, up from the three it had. Cutting the range from 70 to 4.9 gives the ordinary channels back the precision they had lost, and with it the model's accuracy.
In practice is the setting for the OPT and BLOOM families, with a broad flat sweet spot around it (roughly 0.4 to 0.6) so the exact value is not delicate. GLM-130B is the interesting exception: with something like 30% of its channels acting as outliers, an even split still leaves the activations too spiky to quantize, so it pushes and shifts more of the load onto the weights. The dial is not one-size-fits-all, but it is one number found by a quick search on a few calibration batches, not a training run.
Free at runtime
One worry remains. If the activations have to be divided by before every matmul, that is an extra per-channel scaling on the activations at runtime, which sounds like exactly the cost the whole scheme was avoiding. It is not, because the division never happens as its own step. The activation entering a linear layer is produced by the layer just before it, usually a LayerNorm, which already multiplies every output channel by a learned gain and adds a bias, one multiply-add per channel. Folding into that gain and bias changes the numbers they hold, not the number of operations that run, so the smoothed activations arrive pre-divided at no added cost. The weight side is even simpler: multiply the weights by once, offline, and store . At serving time there is no trace of left; the model runs exactly as before.
Together, "training-free" and "offline" reduce the method to one calibration pass to measure the activation maxima, one formula for , and one rewrite of the surrounding layers:
# training-free: one calibration pass, then an offline rewrite.
# X: [tokens, C_i] activations, W: [C_i, C_o] weights, alpha in [0,1]
act_max = zeros(C_i) # per-input-channel |activation| max
for x in calibration_batches: # ~512 sentences, no labels needed
act_max = maximum(act_max, x.abs().amax(dim=0))
wgt_max = W.abs().amax(dim=1) # per-input-channel |weight| max
s = act_max.pow(alpha) / wgt_max.pow(1 - alpha) # eq (4), a vector of length C_i
W_hat = diag(s) @ W # weights scaled up: absorb the spread
ln.weight = ln.weight / s # fold 1/s into the preceding LayerNorm
ln.bias = ln.bias / s # so activations arrive pre-smoothed, free
# W_hat and the smoothed activations now both quantize cleanly to INT8The one case with no preceding scale to fold into is a residual add, where the activation arrives from a shortcut connection that bypasses the layer. There SmoothQuant keeps an explicit scale on the residual branch, following the Gamma Migration trick from Wei and colleagues' Outlier Suppression. It is a small exception and does not change the picture: the transformation is fixed once, offline, and disappears into the weights.
What it buys
Line the methods up on OPT-175B and the split is stark. The three that quantize activations naively (plain W8A8, ZeroQuant, Outlier Suppression) all collapse to about 35% accuracy and perplexity in the tens of thousands, which is a model producing noise. LLM.int8() holds accuracy at 66.7% but, as covered above, gives back the speed. SmoothQuant, at all three of its aggressiveness levels, lands at 66.4 to 66.8%, against FP16's 66.9%, while staying in pure INT8. Toggle to perplexity and the collapse is even clearer: the broken baselines blow past 80,000 while SmoothQuant sits at 11.1, next to FP16's 10.99.
The efficiency is the other half. In the PyTorch implementation SmoothQuant runs OPT models about 1.51× faster while using 1.96× less memory on a single A100, and the tuned FasterTransformer integration (NVIDIA's production inference library) reaches up to 1.56× with memory roughly halved. Those realized numbers sit just under the 2× ceiling that INT8 promises on paper, as you would expect once real kernels, overheads, and the parts that stay in FP16 are counted. The memory saving lands hardest: OPT-175B fits on four GPUs instead of eight, OPT-66B on one instead of two, and the 530-billion-parameter MT-NLG model, which the paper also quantizes with no accuracy loss, serves inside a single 8-GPU node, halving the machine you need to run the largest models made public. Tensor parallelism reaches single-node serving by splitting each layer's matrices across the GPUs of one machine; SmoothQuant reaches the same place from the other direction, by shrinking the numbers instead of splitting the layers.
The reach is broad because the method assumes almost nothing. Any matmul with an activation-times-weight structure and outliers concentrated in a few input channels is a candidate, which is why the same recipe covers OPT, BLOOM, GLM, MT-NLG, and the Llama and Mistral and Mixtral families. There is no architecture it was designed for and no training it needs. You measure a few maxima, choose one number, and rewrite two layers, and a model that could not survive 8 bits now runs on half the GPUs at nearly twice the matmul throughput, with its accuracy intact. The outliers were never the real problem; forcing one tensor to absorb all of them was.
Questions you might still have
Why not just quantize the weights and leave activations in FP16?
Weight-only INT8 (W8A16) dequantizes the weights back to FP16 before the matmul, so the multiply still runs on FP16 units at the FP16 rate. You save memory and bandwidth, which helps single-token decoding, but you get no arithmetic speedup. Only W8A8, both operands in INT8, runs the integer matmul and gets the ~2x throughput. That is why the activations have to be quantized too.
If the rescale is exactly equivalent, why does accuracy drop at all?
The rescale itself loses nothing: in floating point the diagonal factors cancel and the layer output is identical to the bit. Every bit of error comes from the INT8 rounding that happens afterward. Smoothing does not remove that rounding; it makes it easy, by handing the quantizer two mild tensors instead of one with a hundred-fold spike.
Does SmoothQuant get rid of the outliers?
No, it relocates them. The outlier magnitude is rescaled out of the activation and (shrunk by the balancing) onto the weight. It does not vanish; it stops being concentrated in the one tensor that could not absorb it. At the balance point the spike is spread so that neither tensor has a maximum far above its neighbors.
How is this different from LLM.int8(), or from GPTQ and AWQ?
LLM.int8() keeps the outlier channels in FP16 and the rest in INT8, a mixed-precision split that is accurate but whose scattered FP16 matmul is slow. SmoothQuant keeps everything in INT8. GPTQ and AWQ are weight-only methods (W-only, no activation quantization), so they help memory but not the matmul. For the library’s other take on compressing numbers, see the TurboQuant explainer, which tackles the different problem of squeezing individual vectors to a few bits.
Where does alpha = 0.5 come from, and when do you change it?
0.5 is the balance point where the smoothed activation and weight share the same per-channel maximum, the geometric mean of their originals. It works for OPT and BLOOM with a wide sweet spot around it. GLM-130B has harder activations, about 30% of channels acting as outliers, so it migrates more onto the weights with alpha = 0.75. The value is found by a quick grid search on calibration data, not by training.
Footnotes & further reading
- The paper: Xiao, Lin, Seznec, Wu, Demouth, Han, SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (MIT / NVIDIA, ICML 2023). Code.
- Emergent activation outliers and the mixed-precision baseline: Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022).
- The per-channel scale migration and the residual-branch trick SmoothQuant reuses: Wei et al., Outlier Suppression (2022).
- Integer-arithmetic quantization: Jacob et al., Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (2018); the symmetric conventions are surveyed in Gholami et al., A Survey of Quantization Methods (2021).
- For another angle on compressing model numbers, the library's TurboQuant explainer covers near-optimal low-bit compression of individual vectors, a different problem from SmoothQuant's per-channel rebalancing.
How could this explainer be improved? Found an error, or something unclear? I read every message.