VerifiedarXiv:2211.1043822 min
Quantization · LLM inference

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

A few outlier channels wreck 8-bit quantization. Scale the problem into the weights, which quantize cleanly.

A handful of activation channels carry values a hundred times larger than everything else, and those outliers break 8-bit quantization. A rescale that leaves the layer's output identical moves the difficulty onto the weights, which quantize cleanly to 8 bits, so the matmul runs in integer arithmetic.

Explaining the paperSmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language ModelsXiao, Lin, Seznec, Wu, Demouth, Han · MIT · NVIDIA · ICML 2023 · arXiv:2211.10438

Run a 175-billion-parameter model in 8-bit integers and it collapses to gibberish. Eight bits is plenty; the trouble is that a few activation values run a hundred times larger than the rest, and no single ruler can measure both the giants and the crowd.

A large language model is expensive mostly because it is large. GPT-3 has 175 billion parameters, which is 350 GB just to hold the weights in FP16 (16-bit floating point, the usual serving format), enough to need eight 48 GB GPUs or five 80 GB A100s before you have run a single token. That memory, and the bandwidth to move it, is the binding cost of serving one of these models.

The standard lever is quantization: store numbers in fewer bits. Going from 16-bit floats to 8-bit integers (INT8) halves the bytes, which halves the memory and roughly doubles the memory bandwidth you get per second. On an NVIDIA A100 the integer matmul units also run at twice the rate of the float ones (624 trillion INT8 ops per second against 312 trillion FP16, both dense, no sparsity tricks). So INT8 promises about half the memory and about twice the throughput. For a model whose whole problem is size, that is the difference between eight GPUs and four.

There is a condition on the throughput half of that promise, and it drives the entire paper: you only get the integer matmul if both inputs to it are integers. The weights and the activations both have to be INT8. Quantizing just the weights is easy and everyone does it, but it buys memory, not speed. This paper closes the gap between what quantizes easily and what actually runs fast. Its method, SmoothQuant, is training-free (no fine-tuning, no gradient steps, just a rewrite you apply once), keeps every matmul in INT8, and loses almost no accuracy on models up to 530 billion parameters.

A few ideas explain it: how INT8 quantization actually works and why a single outlier ruins it, why the obvious per-channel fix is illegal on the hardware, the exact rescale that moves the problem out of the activations, and how far to move it. None is hard on its own.

Why 8-bit needs both sides

Start with the throughput condition, because it is the constraint everything else bends around. A linear layer, the workhorse inside every Transformer block, computes Y=XW\mathbf{Y} = \mathbf{X}\mathbf{W}: a matrix of activations X\mathbf{X} times a weight matrix W\mathbf{W}. Write the shapes down, because the axes matter later:

Y=XW,XRT×Ci,  WRCi×Co\mathbf{Y} = \mathbf{X}\mathbf{W}, \quad \mathbf{X}\in\mathbb{R}^{T\times C_i},\ \ \mathbf{W}\in\mathbb{R}^{C_i\times C_o}

Here TT is the number of tokens in the batch, CiC_i the input channels (the width of each token vector coming in), and CoC_o the output channels, so the result Y\mathbf{Y} is T×CoT\times C_o. Each output entry is a sum over the shared axis, Yt,o=j=1CiXt,jWj,oY_{t,o} = \sum_{j=1}^{C_i} X_{t,j}\,W_{j,o}. That inner summation axis CiC_i is where the difficulty will concentrate.

There are two ways to quantize this layer. In W8A16 you store the weights in INT8 but leave the activations in FP16. That halves the weight memory, which genuinely helps when you are decoding one token at a time and the run is bottlenecked on reading weights from memory. But it does nothing for the matmul itself, because the INT8 weights get converted back to FP16 before the multiply happens, and the multiply then runs on the FP16 units at the FP16 rate, so you save memory but not compute.

In W8A8 both operands are INT8, so the multiply runs on the integer units and you get the 2× arithmetic. Both halves of the promise, but only if the activations survive being squeezed into 8 bits. Weights do; activations, in a large model, do not. Understanding why takes one look at what an 8-bit quantizer actually is.

How INT8 quantization works

Quantization lays down a ruler with evenly spaced ticks and snaps every real number to the nearest tick. For symmetric 8-bit integers the ruler is centered at zero and runs from 127-127 to +127+127 (the paper uses this restricted range, dropping 128-128 so the codes are symmetric about zero). The only knob is the tick spacing, the step size Δ\Delta:

XˉINT8=XFP16Δ,Δ=max(X)2N11=max(X)127  (N=8)\bar{\mathbf{X}}^{\text{INT8}} = \left\lceil \frac{\mathbf{X}^{\text{FP16}}}{\Delta} \right\rfloor, \qquad \Delta = \frac{\max(|\mathbf{X}|)}{2^{N-1}-1} = \frac{\max(|\mathbf{X}|)}{127}\ \ (N=8)
(1)

Read it left to right: divide every value by the step, round to the nearest integer (that is the \lceil\cdot\rfloor bracket), and you have a code in [127,127][-127, 127]. The step itself is set so that the largest-magnitude value in the tensor lands exactly on the last tick: Δ=max(X)/127\Delta = \max(|\mathbf{X}|)/127. You could instead clip the outlier down to a smaller maximum and give the crowd finer ticks, but those outlier channels carry information the model relies on, so corrupting them costs accuracy. Keeping them ties the tick spacing to a single number: that one maximum fixes the step for every other value.

Two details matter before moving on. First, the ruler can be measured once offline from a few calibration batches (static quantization, cheapest at runtime) or recomputed from each incoming tensor (dynamic, more accurate, slightly slower). Second, you can share one ruler across the entire matrix (per-tensor), or give each token its own (per-token), or each output channel its own (per-channel). The finer the grain, the more faithful the quantization, but also the more scale factors you store, and, for the wrong axis, a scale the integer matmul cannot use at all.

A few channels blow up

One fact makes large-model quantization hard, and it is easier to see than to take on faith. Plot the activations going into a linear layer of a big model and the weights are flat and mild, a gentle spread easily covered by 8 bits. The activations are not. A few channels carry values that tower over everything else, and they are the same few channels in every token.

Figure 1 · the outlier field
hover a channel
A layer's input activation X (tokens × input-channels) over its weight W. Three fixed channels glow in every token row and tower ~70× over the rest on the log-scaled bars below. Toggle to W: flat and uniform, no outliers. Hover any channel to read its max. The spikes sit in fixed channels, persistent across all tokens.

Three properties, all visible above, define these activation outliers. They are large, on the order of a hundred times the typical activation value. They live in a small number of channels, a fraction of a percent of the width. And they are persistent: if a channel is an outlier, it is an outlier in every token and every sequence, so its column stays bright all the way down. It is like a choir where the same few singers, in the same seats, drown out everyone else in every song.

This is not a quirk of one model. Dettmers and colleagues, in the LLM.int8() paper, traced these outliers appearing as a phase transition with scale: below a few billion parameters they are mild, and by around 6.7 billion they have taken over, showing up in essentially every layer at once. It is exactly the regime, models with tens or hundreds of billions of parameters, where you most want quantization and where naive quantization most completely fails. One number to hold onto: OPT-175B scores 66.9% average accuracy in FP16, and naive INT8 drops it to 35.5%, which for these zero-shot tasks is no better than random guessing.

A note on the "hundred times" figure, since two numbers float around and it is easy to think they disagree. SmoothQuant's ~100× is the outlier measured against the typical activation. LLM.int8() quotes a smaller ~3-20×, measured against the largest other channel. Both are right; they just pick different baselines, and either way the outlier dwarfs the crowd, which is all the next step needs.

How one outlier wastes the bits

Now connect the outlier to the ruler. Per-tensor quantization sets one step size from the global maximum, and the global maximum is the outlier. So the ruler stretches to reach a value a hundred times out, its ticks spread a hundred times apart, and the ordinary values, which all live near zero, are left sharing only the few ticks nearest the center.

The paper puts a number on the damage. For an ordinary channel with maximum magnitude mim_i inside a tensor whose global maximum is mm, the fraction of the 256 levels that channel can actually reach is:

effective levels(i)=28mim\text{effective levels}(i) = 2^{8}\cdot\frac{m_i}{m}

When the outlier makes mm about a hundred times an ordinary channel's mim_i, that is roughly 256/1002 to 3256/100 \approx 2\text{ to }3 levels. Note the word: levels, not bits. Bits are just the base-2 logarithm of the number of distinguishable levels, so two or three levels is a little over one bit of real precision (log231.6\log_2 3 \approx 1.6), out of a nominal eight. You paid for INT8 and, for most of the tensor, you are getting worse than INT2. Drag the outlier up in the figure below and watch an ordinary channel's values collapse onto two or three grid lines, rounding into each other until they are indistinguishable.

Figure 2 · effective levels
outlieroutlier ×60
Per-tensor INT8 spreads 256 levels across [m,m][-m, m], with mm pinned by the outlier. Above: the full range, ordinary values crammed near center. Below: a zoom on the normal range, showing the few levels it keeps and the values snapping onto them. Drag the outlier to ~100× and the count drops to the paper's 2-3 of 256.

The bits are all being spent on the empty gap between the crowd and the outlier. The obvious repair is to stop making everyone share one ruler.

The fix the GEMM blocks

If a few channels are the problem, give each channel its own ruler. That is per-channel quantization, and for the outlier channel it is exactly right: measure it against its own maximum, and its ordinary neighbors keep their precision. Per-channel quantization of the weights is standard and completely fine. The activations are where it breaks down, and not because the feature is missing: the obstacle lives in the shape of the matmul.

Recall that each output is a sum over the input-channel axis, Yt,o=jXt,jWj,oY_{t,o} = \sum_j X_{t,j} W_{j,o}. After an INT8 matmul you have to undo the quantization, multiplying by the step sizes to get back to real numbers. A factor that does not depend on the summation index jj is a constant as far as the sum is concerned, so it slides straight out in front. A per-token activation scale ΔX[t]\Delta_X[t] does not depend on jj, so it factors out to the left. A per-output-channel weight scale ΔW[o]\Delta_W[o] does not depend on jj either, so it factors out to the right. That is exactly the dequantization the fast kernels support:

Y=diag(ΔX)(XˉINT8WˉINT8)diag(ΔW)\mathbf{Y} = \operatorname{diag}(\boldsymbol{\Delta}_{\mathbf{X}})\,\big(\bar{\mathbf{X}}^{\text{INT8}}\,\bar{\mathbf{W}}^{\text{INT8}}\big)\,\operatorname{diag}(\boldsymbol{\Delta}_{\mathbf{W}})
(2)

A per-input-channel activation scale Δ[j]\Delta[j] will not factor, because jj is the summation index itself. It rides inside the sum, jΔ[j]Xt,jWj,o\sum_j \Delta[j]\,X_{t,j} W_{j,o}, with a different value on every term, and there is no way to pull it out past the accumulation. The integer matmul accumulates in one pass and applies its scales once at the end; a per-input-channel scale would demand a separate floating-point multiply on every term before adding, which is precisely the integer GEMM you were trying to run. The barrier is the axis, not the granularity. Pick a scheme in the figure and watch where its scale lands relative to the sum.

Figure 3 · which scales survive the sum
Y=XW\mathbf{Y} = \mathbf{X}\mathbf{W} contracts over the inner channel axis CiC_i. A scale factors out of the integer matmul only if it is constant along that sum: per-token (X rows) and per-output-channel (W cols) both do. A per-input-channel activation scale rides the summation index, so it is trapped inside the sum. That is why per-channel activation quantization is off the table.

Every prior method got stuck in this corner. You can have the precision of per-channel activation scaling, or you can have the speed of the integer matmul, but not both, because the scale that would tame the outlier rides the very axis being summed over. LLM.int8() took the precision and paid for it: it detects the outlier channels at runtime, peels them off into a separate FP16 matmul, and runs the rest in INT8, then adds the two back together. That keeps the accuracy (it is nearly lossless), but the outlier columns are scattered and irregular, the extra floating-point matmul plus the gather and scatter around it are expensive, and in SmoothQuant's own benchmarks the result usually runs slower than plain FP16, recovering the accuracy but losing the speedup it was meant to deliver.

SmoothQuant avoids the trade. It gets per-channel-quality quantization of the activations without ever needing a per-input-channel activation scale at runtime. The way through is to notice that the difficulty does not have to stay on the activations at all.

Migrate the difficulty

The activations are hard to quantize and the weights are easy, and they are multiplied together. So move some of the hardness across the multiply. Insert a per-input-channel factor sRCi\mathbf{s}\in\mathbb{R}^{C_i}, divide the activations by it, and multiply the weights by it. On the shared channel axis, dividing then multiplying by the same numbers cancels, so the layer computes the identical output:

Y=(Xdiag(s)1)(diag(s)W)=X^W^=XW\mathbf{Y} = \big(\mathbf{X}\,\operatorname{diag}(\mathbf{s})^{-1}\big)\big(\operatorname{diag}(\mathbf{s})\,\mathbf{W}\big) = \hat{\mathbf{X}}\hat{\mathbf{W}} = \mathbf{X}\mathbf{W}
(3)

The diag(s)1diag(s)\operatorname{diag}(\mathbf{s})^{-1}\operatorname{diag}(\mathbf{s}) in the middle is the identity; it sits at the seam between X\mathbf{X} and W\mathbf{W} and does nothing to the product. It is the algebra of multiplying and dividing a fraction by the same number, (a/s)(sb)=ab(a/s)(sb) = ab, applied one channel at a time. In components, X^t,j=Xt,j/sj\hat{X}_{t,j} = X_{t,j}/s_j shrinks the activation in channel jj, and W^j,o=sjWj,o\hat{W}_{j,o} = s_j\,W_{j,o} grows the corresponding weight row to compensate. Pick a large sjs_j for an outlier channel and its activation spike shrinks while the flat weight there absorbs the difference.

One thing this transform is not is an approximation. In floating point, X^W^\hat{\mathbf{X}}\hat{\mathbf{W}} equals XW\mathbf{X}\mathbf{W} exactly, to the bit. Nothing is lost by rescaling. The only error anywhere in SmoothQuant is the INT8 rounding that comes afterward, and the entire point of the rescale is to make that rounding easy: instead of one tensor with a hundred-fold spike, you now have two tensors that are each mild. The term for what just happened is migrating the quantization difficulty. The activation's outlier problem does not get solved so much as shared, pushed partway onto a weight matrix that had precision to spare.

Below is the migration itself, one bar per channel: activations on top, weights below. Raise the strength and the outlier channel's spike slides from the amber activation into the teal weight, the two meeting balanced in the middle.

Figure 4 · migrating the difficulty
migrateα = 0.50
Per-channel maxima of activation (top) and weight Ŵ (bottom). Drag the migration strength α\alpha. At α=0\alpha{=}0 the activation keeps the whole spike; at α=1\alpha{=}1 the weight does; at α=0.5\alpha{=}0.5 both reach the same height and neither has a large outlier. The product X^W^=XW\hat{\mathbf{X}}\hat{\mathbf{W}} = \mathbf{X}\mathbf{W} is unchanged throughout.

Both extremes fail, for the same reason the middle works. At α=0\alpha=0 every channel keeps its activation spike and you are back to the original broken problem; at α=1\alpha=1 the spike lands entirely on the weights and their quantization collapses instead. Only a split in the middle tames both, and finding the right split is the last piece.

How much to migrate

How much difficulty to move is a single dial, the migration strength α\alpha, and the paper sets the smoothing factor from the per-channel maxima of both tensors:

sj=max(Xj)αmax(Wj)1α,j=1,,Ci\mathbf{s}_j = \frac{\max(|\mathbf{X}_j|)^{\alpha}}{\max(|\mathbf{W}_j|)^{1-\alpha}}, \qquad j = 1,\dots,C_i
(4)

The exponents split a channel's difficulty on a log scale:

logsj=αlogmaxXj(1α)logmaxWj\log s_j = \alpha\log\max|\mathbf{X}_j| - (1-\alpha)\log\max|\mathbf{W}_j|

To see what the balance does, push the formula through. After smoothing, the two channel maxima come out as:

maxX^j=(maxXjmaxWj)1α,maxW^j=(maxXjmaxWj)α\max|\hat{X}_j| = \big(\max|\mathbf{X}_j|\,\max|\mathbf{W}_j|\big)^{1-\alpha}, \qquad \max|\hat{W}_j| = \big(\max|\mathbf{X}_j|\,\max|\mathbf{W}_j|\big)^{\alpha}

Those two are equal exactly when 1α=α1-\alpha = \alpha, so only at α=0.5\alpha = 0.5 do the smoothed activation and weight share the same per-channel maximum: both land on maxXjmaxWj\sqrt{\max|\mathbf{X}_j|\,\max|\mathbf{W}_j|}, the geometric mean of the original maxima. (A caution, because it is a natural slip: the geometric mean is the value of the smoothed maxima, not of sjs_j itself. The factor sjs_j is a ratio of maxima, maxXj/maxWj\sqrt{\max|\mathbf{X}_j|/\max|\mathbf{W}_j|} at α=0.5\alpha=0.5.)

Make it concrete on the outlier channel. Say it has activation maximum 70 and weight maximum 0.34. At α=0.5\alpha=0.5, sj=70/0.3414.3s_j = \sqrt{70/0.34} \approx 14.3. The activation drops from 70 to 70/14.34.970/14.3 \approx 4.9, and the weight rises from 0.34 to 14.3×0.344.914.3\times 0.34 \approx 4.9. Both at 4.9. The per-tensor step now spans only to 4.9 instead of 70, so plugging the new ceiling back into the effective-levels count 28mi/m2^8\cdot m_i/m shows an ordinary channel with maximum near 1 now getting about 256(1/4.9)50256\cdot(1/4.9) \approx 50 levels, up from the three it had. Cutting the range from 70 to 4.9 gives the ordinary channels back the precision they had lost, and with it the model's accuracy.

In practice α=0.5\alpha=0.5 is the setting for the OPT and BLOOM families, with a broad flat sweet spot around it (roughly 0.4 to 0.6) so the exact value is not delicate. GLM-130B is the interesting exception: with something like 30% of its channels acting as outliers, an even split still leaves the activations too spiky to quantize, so it pushes α=0.75\alpha=0.75 and shifts more of the load onto the weights. The dial is not one-size-fits-all, but it is one number found by a quick search on a few calibration batches, not a training run.

Free at runtime

One worry remains. If the activations have to be divided by s\mathbf{s} before every matmul, that is an extra per-channel scaling on the activations at runtime, which sounds like exactly the cost the whole scheme was avoiding. It is not, because the division never happens as its own step. The activation entering a linear layer is produced by the layer just before it, usually a LayerNorm, which already multiplies every output channel by a learned gain and adds a bias, one multiply-add per channel. Folding 1/s1/\mathbf{s} into that gain and bias changes the numbers they hold, not the number of operations that run, so the smoothed activations arrive pre-divided at no added cost. The weight side is even simpler: multiply the weights by s\mathbf{s} once, offline, and store W^\hat{\mathbf{W}}. At serving time there is no trace of s\mathbf{s} left; the model runs exactly as before.

Together, "training-free" and "offline" reduce the method to one calibration pass to measure the activation maxima, one formula for s\mathbf{s}, and one rewrite of the surrounding layers:

# training-free: one calibration pass, then an offline rewrite.
# X: [tokens, C_i] activations,  W: [C_i, C_o] weights,  alpha in [0,1]

act_max = zeros(C_i)                       # per-input-channel |activation| max
for x in calibration_batches:              # ~512 sentences, no labels needed
    act_max = maximum(act_max, x.abs().amax(dim=0))
wgt_max = W.abs().amax(dim=1)              # per-input-channel |weight| max

s = act_max.pow(alpha) / wgt_max.pow(1 - alpha)   # eq (4), a vector of length C_i

W_hat = diag(s) @ W                        # weights scaled up: absorb the spread
ln.weight = ln.weight / s                  # fold 1/s into the preceding LayerNorm
ln.bias   = ln.bias   / s                  # so activations arrive pre-smoothed, free
# W_hat and the smoothed activations now both quantize cleanly to INT8

The one case with no preceding scale to fold into is a residual add, where the activation arrives from a shortcut connection that bypasses the layer. There SmoothQuant keeps an explicit scale on the residual branch, following the Gamma Migration trick from Wei and colleagues' Outlier Suppression. It is a small exception and does not change the picture: the transformation is fixed once, offline, and disappears into the weights.

What it buys

Line the methods up on OPT-175B and the split is stark. The three that quantize activations naively (plain W8A8, ZeroQuant, Outlier Suppression) all collapse to about 35% accuracy and perplexity in the tens of thousands, which is a model producing noise. LLM.int8() holds accuracy at 66.7% but, as covered above, gives back the speed. SmoothQuant, at all three of its aggressiveness levels, lands at 66.4 to 66.8%, against FP16's 66.9%, while staying in pure INT8. Toggle to perplexity and the collapse is even clearer: the broken baselines blow past 80,000 while SmoothQuant sits at 11.1, next to FP16's 10.99.

Figure 5 · accuracy recovered
SQ-O3
OPT-175B under INT8, from the paper's Table 3. Naive baselines collapse to ~random; LLM.int8() keeps accuracy but is slower in this harness; SmoothQuant recovers near-FP16 accuracy in pure INT8. Toggle to perplexity (log) for the blow-up; click any bar for its numbers.

The efficiency is the other half. In the PyTorch implementation SmoothQuant runs OPT models about 1.51× faster while using 1.96× less memory on a single A100, and the tuned FasterTransformer integration (NVIDIA's production inference library) reaches up to 1.56× with memory roughly halved. Those realized numbers sit just under the 2× ceiling that INT8 promises on paper, as you would expect once real kernels, overheads, and the parts that stay in FP16 are counted. The memory saving lands hardest: OPT-175B fits on four GPUs instead of eight, OPT-66B on one instead of two, and the 530-billion-parameter MT-NLG model, which the paper also quantizes with no accuracy loss, serves inside a single 8-GPU node, halving the machine you need to run the largest models made public. Tensor parallelism reaches single-node serving by splitting each layer's matrices across the GPUs of one machine; SmoothQuant reaches the same place from the other direction, by shrinking the numbers instead of splitting the layers.

The reach is broad because the method assumes almost nothing. Any matmul with an activation-times-weight structure and outliers concentrated in a few input channels is a candidate, which is why the same recipe covers OPT, BLOOM, GLM, MT-NLG, and the Llama and Mistral and Mixtral families. There is no architecture it was designed for and no training it needs. You measure a few maxima, choose one number, and rewrite two layers, and a model that could not survive 8 bits now runs on half the GPUs at nearly twice the matmul throughput, with its accuracy intact. The outliers were never the real problem; forcing one tensor to absorb all of them was.

Provenance Verified against primary literature
Integer quantizationJacob et al. 2018 (affine INT8); symmetric restricted range per the Gholami survey / NVIDIA Wu et al. 2020.
LLM.int8() (Dettmers 2022)Emergent activation outliers past ~6.7B; the mixed-precision FP16 baseline SmoothQuant replaces.
Outlier Suppression (Wei 2022)Per-channel scale migration and the residual-branch (Gamma Migration) trick.
CUTLASS / FasterTransformerINT8 GEMM kernels and the production serving integration.
correctionSmoothQuant's rescale is exactly identity-preserving in FP (the diagonal factors cancel), so smoothing loses nothing on its own; the only error is the later, now-easy INT8 rounding. And the 'infeasible per-channel' case is specifically per-INPUT-channel ACTIVATION scaling, which rides the GEMM's contraction axis. Per-channel WEIGHT scaling is standard and fine.

Questions you might still have

?

Why not just quantize the weights and leave activations in FP16?
Weight-only INT8 (W8A16) dequantizes the weights back to FP16 before the matmul, so the multiply still runs on FP16 units at the FP16 rate. You save memory and bandwidth, which helps single-token decoding, but you get no arithmetic speedup. Only W8A8, both operands in INT8, runs the integer matmul and gets the ~2x throughput. That is why the activations have to be quantized too.

?

If the rescale is exactly equivalent, why does accuracy drop at all?
The rescale itself loses nothing: in floating point the diagonal factors cancel and the layer output is identical to the bit. Every bit of error comes from the INT8 rounding that happens afterward. Smoothing does not remove that rounding; it makes it easy, by handing the quantizer two mild tensors instead of one with a hundred-fold spike.

?

Does SmoothQuant get rid of the outliers?
No, it relocates them. The outlier magnitude is rescaled out of the activation and (shrunk by the balancing) onto the weight. It does not vanish; it stops being concentrated in the one tensor that could not absorb it. At the balance point the spike is spread so that neither tensor has a maximum far above its neighbors.

?

How is this different from LLM.int8(), or from GPTQ and AWQ?
LLM.int8() keeps the outlier channels in FP16 and the rest in INT8, a mixed-precision split that is accurate but whose scattered FP16 matmul is slow. SmoothQuant keeps everything in INT8. GPTQ and AWQ are weight-only methods (W-only, no activation quantization), so they help memory but not the matmul. For the library’s other take on compressing numbers, see the TurboQuant explainer, which tackles the different problem of squeezing individual vectors to a few bits.

?

Where does alpha = 0.5 come from, and when do you change it?
0.5 is the balance point where the smoothed activation and weight share the same per-channel maximum, the geometric mean of their originals. It works for OPT and BLOOM with a wide sweet spot around it. GLM-130B has harder activations, about 30% of channels acting as outliers, so it migrates more onto the weights with alpha = 0.75. The value is found by a quick grid search on calibration data, not by training.

Footnotes & further reading

  1. The paper: Xiao, Lin, Seznec, Wu, Demouth, Han, SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (MIT / NVIDIA, ICML 2023). Code.
  2. Emergent activation outliers and the mixed-precision baseline: Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022).
  3. The per-channel scale migration and the residual-branch trick SmoothQuant reuses: Wei et al., Outlier Suppression (2022).
  4. Integer-arithmetic quantization: Jacob et al., Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (2018); the symmetric conventions are surveyed in Gholami et al., A Survey of Quantization Methods (2021).
  5. For another angle on compressing model numbers, the library's TurboQuant explainer covers near-optimal low-bit compression of individual vectors, a different problem from SmoothQuant's per-channel rebalancing.