VerifiedarXiv:1503.0253118 min
Training · Model compression

Distilling the Knowledge in a Neural Network

A big model ranks the wrong answers too, and that ranking teaches a small one to generalize.

Hinton, Vinyals, and Dean (2015) call it distillation: soften the big model's softmax with a temperature so those rankings show, then train the small model to match them. Most of the big model's ability to generalize comes with it, often from a fraction of the data.

Explaining the paperDistilling the Knowledge in a Neural NetworkHinton, Vinyals, Dean · Google · NIPS 2014 Deep Learning Workshop · arXiv:1503.02531

A trained classifier's answer is a single label. Almost everything it learned is hiding in the answers it ranked just below.

One problem shapes how machine learning gets deployed. The most reliable way to push accuracy up a notch is to train many models and average their predictions. The average cancels the idiosyncratic mistakes of any single model and keeps what they agree on, so an ensemble almost always beats its members. The trouble shows up at deployment. Running ten models for every query costs ten times the compute and ten times the memory, and a service answering billions of requests on phones cannot pay that. The size and multiplicity that make the model good also make it impossible to ship.

So you want the accuracy of the cumbersome model and the cost of a small one. Rich Caruana and his collaborators showed this is not a fantasy: the knowledge in a big ensemble can be compressed into a single small network that is far easier to deploy.2 This paper takes that idea and gives it a clean, general mechanism, which the authors call distillation.

One habit of thought gets in the way first. Clear it and the rest of the paper follows. We tend to identify a model's knowledge with its weights, which makes it hard to imagine moving that knowledge into a network with completely different weights. A more useful view is that the knowledge is a learned mapping from inputs to outputs, and the output worth copying is not the single class the model picks. It is the entire distribution over classes. When a good model sees a BMW it puts almost all its probability on "BMW," but the scraps it leaves on the wrong classes are not random: a tiny chance of "garbage truck," a far tinier chance of "carrot." That ordering says a BMW looks more like a truck than like a vegetable, and a model that learned to see the world that way will generalize better than one that only ever knew the right label. A hard one-hot label throws all of that away.

A few ideas carry the paper: what a trained model actually knows, a temperature knob that exposes it, the gradient that transfers it, and why distillation subsumes the older compression method as a special case. The experiments then push it from MNIST to a production speech recognizer to a 15,000-class image problem.

The cumbersome model and the deployment gap

Ensembles win for a concrete reason, and that reason is also why distillation can recover their gains. Each model in an ensemble makes errors that come partly from where it happened to land in training: its random initialization, the order it saw the data. Average many such models and those private errors partly cancel, while the signal they all picked up reinforces. The paper's speech ensemble is ten copies of the same architecture trained from different random seeds, and that alone makes the average beat any single net. Dropout, the regularizer the MNIST model leans on, is the same idea folded into one network: it trains a huge family of weight-sharing subnetworks at once, then at test time runs the full network once with its weights scaled down, a single pass that approximates averaging them all.5

The cumbersome model, then, is either a literal ensemble or a single very large, heavily regularized network. Either way it is wonderful for extracting structure from data and miserable to deploy. The analogy the paper opens with is an insect with a larval form built for eating and an adult form built for travel: there is no rule that the thing you train and the thing you ship have to be the same shape. Training can be slow, parallel, and enormous; deployment has to be fast and small. Distillation is the metamorphosis between them.

The goal is now sharp: take a trained cumbersome model and transfer its knowledge into a small "distilled" model that runs cheaply, while keeping as much of the cumbersome model's accuracy, and especially its way of generalizing, as possible.

Knowledge is the whole distribution

Start with what the small model should be trained on. The obvious answer, the true labels, is exactly what throws away the knowledge we just identified. A one-hot label says "this is a 2" and assigns zero to every other digit, so it cannot tell the student that this particular 2 looks a bit like a 3. The richer target is the cumbersome model's own output distribution, used as a soft target: train the small model to reproduce the big model's full vector of class probabilities.

For a confident model the informative part of that vector is microscopic, which is the obstacle the rest of the section clears. On MNIST the cumbersome model is right with overwhelming confidence, so one version of a 2 might get a probability of 10610^{-6} of being a 3 and 10910^{-9} of being a 7, while another version of a 2 has those two flipped. That ratio (this 2 leans 3, that 2 leans 7) is precisely the similarity structure worth transferring, and at face value it has almost no effect on the training loss because the numbers are so close to zero. A target of 10610^{-6} contributes almost nothing to the cross-entropy loss.

One knob on the softmax brings those ratios back into reach. A network turns its logits (the raw, unbounded scores ziz_i it computes for each class) into probabilities by exponentiating and normalizing. Insert a temperature TT (the name is from physics, where heating a system spreads its probability over more states) that divides every logit before the exponential:

qi=exp(zi/T)jexp(zj/T)q_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}
(1)

At T=1T=1 this is the ordinary softmax. Raising TT shrinks the gaps between logits before they are exponentiated, so the probabilities spread out: the distribution gets softer and its entropy rises, and in the limit TT\to\infty every class approaches the same 1/N1/N. Lowering TT toward zero does the opposite, sharpening onto the single top class until the output is one-hot. (It is easy to misremember the direction, so pin it down: raising TT makes the distribution softer and flatter, lowering it makes the distribution sharper.) Softening is exactly what surfaces those buried ratios. Raise the temperature and the 10610^{-6} and 10910^{-9} grow into numbers a student can actually be pulled toward.

The figure makes the knob concrete. A teacher looks at one image of a 2 and produces fixed logits; drag the temperature and watch the distribution melt. At T=1T=1 the 2 owns about 96% and the rest are invisible. Raise TT to 4 and the same logits read roughly 40% on the 2, 17% on the 3, 13% on the 7, with the genuinely dissimilar digits (the 0, the 4) still pinned near the floor. Drag the temperature and watch the teal bars, the 3 and the 7, climb clear of the floor while the dissimilar digits stay pinned:

Figure 1 · soft targets
T = 4.0
A teacher's softmax over the ten digits for one image of a 2. At T=1 the answer takes almost all the probability and everything else is near zero. Raise T and the second guesses (3 and 7) climb well clear of the dissimilar digits, while the 0 and the 4 stay pinned at the floor.

This structure is not the same thing as label smoothing, and the difference matters. Label smoothing also replaces a one-hot target with something softer, but it spreads a fixed, equal sliver of probability over every wrong class regardless of the input. Soft targets are the opposite: non-uniform and tied to the specific image, so the classes that light up are the ones this input genuinely resembles. Temperature alone only softens; the structure comes from a teacher that actually learned the data. The popular name for the signal carried in those wrong-class probabilities is "dark knowledge," though that phrase is from a later Hinton talk, not this paper, which calls it a "rich similarity structure."4

Why this buys so much: a hard label is one symbol out of ten, a few bits per training case. A soft target is a full distribution, so each case transmits the entire geometry of how the classes relate, and the paper notes a second benefit, "much less variance in the gradient between training cases." Two nearly identical 2s get nearly identical soft targets, so the gradients they produce point the same way instead of both slamming into the rigid corner a one-hot label demands. More information per case and a steadier signal across cases is why, as we will see, a distilled model can learn from far less data and at a higher learning rate.

The distillation recipe

The training loop follows from the reframing. Run the frozen teacher to get its logits viv_i, soften them at temperature TT into soft targets pip_i, and train the student so its own softened probabilities qiq_i match. "Match a fixed target distribution" is cross-entropy, and because the teacher's pp is fixed, minimizing that cross-entropy is the same as minimizing the KL divergence from teacher to student up to a constant (the teacher's own entropy, which the student cannot change):

ipilogqi=H(p)+DKL(pq)-\sum_i p_i \log q_i = H(p) + D_{\mathrm{KL}}(p \,\|\, q)

One detail in that sentence is easy to skip: the student's softmax also runs at the same high TT during training. You are matching the teacher's softened distribution with the student's softened distribution, both warmed by the same temperature. Only after training does the student revert to T=1T=1 for actual predictions. Forgetting that the student trains hot is the most common way distillation gets implemented wrong.

When the true labels are also available, you can do a little better by adding them as a second objective: a cross-entropy with the correct hard labels, computed at T=1T=1. The paper found the best results come from putting a considerably lower weight on this hard term than on the soft one. The student cannot exactly match the soft targets anyway, and when it has to err, erring toward the correct answer is the helpful direction to lean.

Combining the two terms has one trap. The gradient from the soft-target term shrinks as 1/T21/T^2 when you raise the temperature (the next section shows why). If you simply add the soft and hard losses, then every time you bump TT up to experiment, the soft contribution fades and the hard label takes over, so your two knobs interfere. The remedy is to multiply the soft-target loss by T2T^2, which cancels that shrinkage and keeps the two contributions in roughly the same scale as TT changes. The recipe is a few lines:

# distillation loss for one batch (the teacher is frozen)
v = teacher(x)                        # teacher logits  [B, N]
z = student(x)                        # student logits  [B, N]
p = softmax(v / T)                    # soft targets, temperature T
q = softmax(z / T)                    # student, the SAME T
soft = cross_entropy(q, p) * (T * T)  # T^2 restores the soft scale
hard = cross_entropy(softmax(z), y)   # correct labels, T = 1
loss = soft + lam * hard              # lam small: mostly soft targets
loss.backward()                       # gradient flows into the student
# at test time the student is run at T = 1

Nothing else about training changes, and everything downstream (the MNIST transfer, the speech ensemble, the specialists) runs this same loss at a different scale. What it buys is the deployment win we started with: after training, you throw the teacher away and ship a single small network that runs at T=1T=1 like any other classifier.

Matching logits is the high-temperature limit

The one piece of real derivation in the paper answers a question you might already be asking: how does this relate to the older method of matching the cumbersome model's raw logits with a squared error? The answer ties the new method to the old one cleanly, and every step is something you already know.

Start with the gradient of the soft-target cross-entropy on a student logit. Softmax with cross-entropy has a famously clean derivative: the log in the cross-entropy undoes the exponential the softmax put in, the messy softmax Jacobian cancels, and what survives is prediction minus target. Here the softmax's input is the scaled logit zi/Tz_i/T, so the clean result is with respect to that scaled input, and one chain-rule factor 1/T1/T converts it back to the real logit:

Czi=1T(qipi)=1T(exp(zi/T)jexp(zj/T)exp(vi/T)jexp(vj/T))\frac{\partial C}{\partial z_i} = \frac{1}{T}\,(q_i - p_i) = \frac{1}{T}\left(\frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)} - \frac{\exp(v_i/T)}{\sum_j \exp(v_j/T)}\right)
(2)

Read it plainly: each student logit is pushed down where the student puts more mass than the teacher (qi>piq_i > p_i) and up where it puts less. The single 1/T1/T out front comes from the chain rule through the division by TT; it is not yet the 1/T21/T^2 scaling from the last section. That second factor appears only when we take TT large.

Now take it large. For high TT the exponent zi/Tz_i/T is small, so exp(zi/T)1+zi/T\exp(z_i/T) \approx 1 + z_i/T. Apply that to every term, top and bottom (using it on only one is a common slip), and the gradient straightens out:

Czi1T(1+zi/TN+jzj/T1+vi/TN+jvj/T)\frac{\partial C}{\partial z_i} \approx \frac{1}{T}\left(\frac{1 + z_i/T}{N + \sum_j z_j/T} - \frac{1 + v_i/T}{N + \sum_j v_j/T}\right)
(3)

One more assumption collapses it. Suppose the logits have been zero-meaned separately for each transfer case, so jzj=jvj=0\sum_j z_j = \sum_j v_j = 0. This is free to assume, because softmax is unchanged if you add a constant to all logits, so shifting each case to mean zero changes nothing about the model. With both sums gone, each denominator equals NN and the constant 1 in each numerator cancels between the two terms, leaving:

Czi1NT2(zivi)\frac{\partial C}{\partial z_i} \approx \frac{1}{N T^2}\,(z_i - v_i)
(4)

Two things fall out at once. First, the 1/T21/T^2 from the last section is now visible: the high-temperature gradient really does shrink quadratically in TT, which is exactly the shrinkage the T2T^2 multiplier undoes. Second, (zivi)(z_i - v_i) is the gradient of 12(zivi)2\tfrac{1}{2}(z_i - v_i)^2, and the 1/(NT2)1/(N T^2) in front is a positive constant that does not move where the minimum sits. So in the high-temperature limit, distillation is doing least-squares matching of the student's logits to the teacher's. The general soft-target method contains the older logit-matching method as its TT\to\infty special case.

That places a small but real correction on the historical record. The 2006 Model Compression paper usually credited for logit matching did not actually regress logits: it labeled a large transfer set with the ensemble's predictions and trained a small net on those. The explicit recipe of regressing the pre-softmax logits with a squared error is the later Ba and Caruana (2014) result.3 Distillation's Eq (4) subsumes that whole line of work as one limit of a single method.

The interesting regime is the one in between, and Eq (4) shows why a moderate temperature can beat a very high one. At high TT the student is forced to match every logit equally, including the teacher's very negative ones, the "definitely not this class" scores. A class the teacher is already sure is wrong contributes almost no gradient during the teacher's own training, so whether that logit ended at -8 or -12 barely moved the loss and it drifted wherever it happened to land. That makes the very-negative scores noise rather than signal. At a lower TT the soft target on those classes is almost zero, so distillation pays them little attention and the student is free to ignore the noise. For a student too small to absorb everything, ignoring it is the right trade, which is why the paper finds the best temperature drops as the student shrinks: above 8 for a roomy 300-unit student, but down around 2.5 to 4 once the student is squeezed to 30 units per layer.

The figure lets you check this by eye. Each row is a class with the teacher's logit (amber) and the student's (teal); the teal arrow is the gradient pull on the student logit. Raise the temperature and the pulls line up into pure logit matching (the printed cosine climbs to 1.00); drop it and the very negative logit, the "4," loses its pull entirely while the others keep theirs:

Figure 2 · the high-temperature limit
T = 2.5
Six classes on a shared logit axis: teacher logit and student logit per row, with the gradient pull on the student. Raising T makes the pull chase every teacher logit in proportion to the gap (pure logit matching, cosine → 1.00) and shrinks the entire field as 1/T². Lowering T leaves the very-negative "4" with almost no pull: distillation ignores the noisy logit.

The temperature does more than soften. Set it high and you recover logit matching, attending equally to every logit; set it moderate and you keep the useful similarity structure while dropping the teacher's least reliable opinions. The right setting depends on how much the student can hold.

Transferring how to generalize

The claim that soft targets transfer how a model generalizes, beyond its raw accuracy, gets its sharpest test on MNIST. Train a large net (two hidden layers of 1200 units, with dropout and input jitter) and it makes 67 test errors. Train a small net (two hidden layers of 800 units, no regularization) the ordinary way and it makes 146. Now take that same small net, drop the regularization entirely, and add one thing only: a term that matches the large net's soft targets at T=20T=20. It falls to 74 errors. Soft targets alone, with no jitter and no dropout, did most of the work of regularizing the small net, including transferring knowledge about how to handle translated digits even though the transfer set contained no translations.

The sharpest test removes a digit entirely. Take every example of the digit 3 out of the transfer set, so the student is never once given a 3 as a target. It still classifies test 3s correctly. Walk through why, because this one experiment carries the argument in miniature. The student never sees a 3, but it sees many 2s, 5s, and 8s whose soft targets carry a little 3 probability, because those digits genuinely resemble a 3. From that leakage alone it learns the shape of a 3 well enough to recognize one. Only the output bias for class 3 is missing, since nothing in training ever made 3 the answer to push that bias up. Left where training put it, that bias gives 206 errors, 133 of them on the 1010 test 3s. Add 3.5 to it, a single scalar tuned on held-out data rather than computed, and the student makes 109 errors, only 14 on 3s: 98.6% of the test 3s correct, for a digit it was never shown. The knowledge of what a 3 looks like was there all along, transferred through other digits; only one scalar was miscalibrated.

The same calibration idea runs in reverse when a class is over-represented. Build the transfer set from only the 7s and 8s of the training set and the student over-predicts them, making 47.3% test errors; lower the 7 and 8 biases by 7.6 to undo the skew and that falls to 13.2%. Increasing an under-represented class's bias and decreasing an over-represented one's are the same correction pointed two ways.

Matching the ensemble, and learning from 3% of the data

MNIST is a toy; the speech experiment is a production system. The acoustic model is a real one, close to the version Android voice search used at the time: 8 hidden layers of 2560 units feeding a softmax over 14,000 HMM-state labels, about 85 million parameters, trained on roughly 700 million frames from 2000 hours of speech. The single baseline reaches 58.9% frame accuracy and a 10.9% word error rate. Frame accuracy scores each tiny audio slice against its HMM-state label, so it looks low on its own; the language model downstream fixes up many of those slips, so the word error rate, the number a user actually feels, tracks quality better. Train ten copies from different random seeds and average them, and the ensemble lifts that to 61.1% frame accuracy and 10.7% word error.

Then distill the ensemble into a single model of the same size as the baseline. It reaches 60.8% frame accuracy and the same 10.7% word error: more than 80% of the ensemble's frame-accuracy gain, recovered in one network that costs no more to run than the baseline did. The expensive part, training ten models, stays in the lab, and the deployed model is a single net that happens to know what the ensemble knew, the paper's promise landing on a system people actually used.

The same speech model also gives the cleanest demonstration that soft targets regularize. Retrain it on only 3% of the data, about 20 million frames. With hard labels it overfits hard: training accuracy climbs to 67.3% while test accuracy peaks at 44.5% and then falls, so you have to stop early to catch the peak. Train the same model on the same 3% with soft targets from the full-data teacher and it converges smoothly to 57.0% test, no early stopping needed, within about two points of the 58.9% you get from all the data. The tell that this is regularization rather than the soft model simply fitting less: its training accuracy, 65.4%, is about the same as the hard model's 67.3%. Both fit the small set well; only the soft one generalizes off it, because the teacher's soft targets smuggle in the regularities of the 97% of the data this model never saw. Press play and watch the hard run climb to its peak and turn down while the soft run keeps rising:

Figure 3 · soft targets as a regularizer
0%
Test accuracy over training on 3% of the speech data. The hard-target run peaks at 44.5% then overfits downward (early-stopped), while its train accuracy keeps climbing to 67.3%. The soft-target run converges to 57.0%, close to the 58.9% from all the data, with no early stopping. Endpoints are the paper's Table 5; the trajectories between are illustrative.

One way to read the paper: a soft target is a channel for shipping the regularities a model discovered on a huge dataset to another model that will only ever see a sliver of it. The bits that a hard label cannot hold are exactly the bits that say how to generalize.

Specialists for a 15,000-class problem

The last part of the paper turns the ensemble idea around. On a dataset like JFT (an internal Google set of 100 million images across 15,000 labels), even training a full ensemble is out of reach: the baseline net took about six months to train, and you cannot wait years to average several of them. The question becomes how to get an ensemble's benefit without an ensemble's training cost.

The extra models can be small and narrow instead. Keep one generalist trained on everything, then add many specialists, each trained only on a cluster of classes that the generalist tends to confuse (different breeds of mushroom, different kinds of bridge). A specialist does not need a 15,000-way softmax; it keeps its few hundred special classes and collapses everything else into a single dustbin class. That makes it tiny and fast, and because it only has to separate classes that already look alike, it sharpens exactly where the generalist is vague. The figure shows the split: the generalist spreads its bet across a confusable cluster, while a specialist trained on that cluster commits:

Figure 4 · generalist and specialists
On a confusable cluster the generalist is unsure which member it is. A specialist trained only on that cluster, plus one wide dustbin for the other 14,700-odd classes, makes the fine call. Switch clusters with the buttons. Cluster themes are the paper's Table 2; probabilities are illustrative.

Several design choices make this cheap and parallel. Each specialist is initialized from the generalist's weights, so it inherits all the low-level feature detectors and only has to refine, then it is trained on a mix of half its special classes and half random examples from everything else. That mix over-represents the special classes, so the trained model comes out biased toward them. Since adding a constant to a logit multiplies that class's score by ee raised to the constant before the softmax normalizes, adding the log of the oversampling factor to the dustbin's logit scales the dustbin back up by exactly that factor and undoes the skew. The clusters themselves are found without any labels: run K-means on the columns of the covariance matrix of the generalist's predictions. Each class gets a column recording how its predicted probability moves together with every other class's across images, so two classes the generalist keeps hedging between have similar columns and land in the same cluster. (Section 7 of the paper loosely calls this the confusion matrix, but the method section is explicit that it is the covariance of predictions, chosen precisely because it needs no true labels.) Because no specialist depends on any other, they train completely independently, in days instead of the weeks a real ensemble would take.

At test time you combine them in two steps. First the generalist names its single top class; then you wake only the specialists whose cluster contains that class, the active set. To merge the generalist's opinion with the active specialists', you look for the one full distribution qq closest to all of them at once, minimizing a sum of KL divergences:

minq  DKL(pgq)+mAkDKL(pmq)\min_{q}\; D_{\mathrm{KL}}(p^g \,\|\, q) + \sum_{m \in A_k} D_{\mathrm{KL}}(p^m \,\|\, q)
(5)

Here pgp^g is the generalist's distribution and each pmp^m is an active specialist's. A specialist has no opinion about classes outside its cluster; everything else it just calls "dustbin." So to compare qq against it, you pool all of qq's mass on those outside classes into one dustbin number and match that. There is no closed-form answer in general, so you run gradient descent on the logits of qq for each image, at T=1T=1. When every model emits one number per class, the optimum of this forward KL is the plain average of everyone's probabilities, a vote where each model gets one ballot. (Writing the KL the other way around would multiply the distributions instead, a geometric mean where one confident model can veto a class; the paper takes the gentler average.) On JFT, adding 61 such specialists lifts top-1 test accuracy from 25.0% to 26.1%, a 4.4% relative gain, and the more specialists cover a given class the larger the gain, which is encouraging precisely because specialists are so easy to add in parallel.

This resembles a mixture of experts but avoids the thing that makes mixtures hard to scale. A mixture trains a gating network jointly with the experts, so the experts and the gate keep changing what they ask of each other, and the training is hard to parallelize. Here there is no learned gate. The "gate" is the frozen generalist's top-1 lookup, fixed before any specialist trains, so the specialists can be trained alone and in parallel.

What distillation became

The reframing outlasts the experiments. A trained network's knowledge is not its weights and not the labels it outputs; it is the full distribution it places over every answer, and that distribution is portable. Soften it with a temperature to expose the structure, train a small model to match it, and you can carry an ensemble's accuracy, or a giant model's generalization, into something you can actually deploy. The same move unified the older compression work as a single limit and turned a 15,000-class problem into a generalist plus a swarm of cheap, independent specialists.

The paper ends on a loop it did not close: it had not yet distilled the specialists back into one big net, and Section 6.1 floats, as work in progress, the idea of regularizing those specialists with soft targets too. The broader loop did close, repeatedly. The soft-target idea is now the standard way to shrink models, and it kept generalizing into new domains. It distills black-box language models like GPT-4 through an aligned proxy in the Proxy-KD work, and it compresses many-step diffusion models down to a handful of steps in score-regularized consistency distillation. The temperature softmax and the soft-target loss in this short 2015 paper are the common root of all of it.

Provenance Verified against primary literature
Hinton, Vinyals, Dean (2015)The distillation method, temperature softmax, and all experiments.
Buciluǎ, Caruana, Niculescu-Mizil (2006)Model compression: train a small net on an ensemble’s predictions.
Ba & Caruana (2014)The explicit logit-squared-error matching distillation generalizes.
Hinton et al. (2012)Dropout, the regularizer the MNIST teacher relies on.
correctionThe logit-matching trick is usually credited to the 2006 Model Compression paper, but that paper trained on the ensemble's predictions, not its logits. Explicit logit regression with squared error is Ba & Caruana (2014). Distillation's high-temperature limit (Eq 4) subsumes both.

Questions you might still have

?

Isn’t raising the temperature just blurring the labels, like label smoothing?
No. Label smoothing puts the same uniform floor on every wrong class, the same for every image. Soft targets are non-uniform and input-dependent: which wrong classes get probability is exactly what this image resembles. A trained teacher supplies that structure; temperature only makes it visible to the loss.

?

Why multiply the soft-target loss by T²?
Soft-target gradients shrink as 1/T² as you raise the temperature. If you mix soft and hard losses without correcting for that, the hard-label term takes over whenever T is large. Multiplying the soft term by T² holds the two contributions in roughly the same scale so you can change T freely.

?

How can the distilled model classify a digit it never saw?
When the transfer set has no 3s, the model never gets a 3 as a target, but 3-ness leaks into the soft targets of other digits (a 2 that looks like a 3 carries a little 3 probability). It learns the shape of a 3 from that leakage. Only one number is left wrong: the output bias for class 3, which nothing pushed up. Add 3.5 to it and 98.6% of test 3s are correct.

?

Did earlier work already match logits?
Caruana and collaborators showed an ensemble can be compressed into one small net. The 2006 Model Compression paper trained the small net on the ensemble’s predicted labels over a large transfer set; the explicit recipe of regressing the pre-softmax logits with squared error is Ba & Caruana, 2014. Distillation contains both: matching logits is its high-temperature limit.

?

Does modern knowledge distillation descend from this paper?
Directly. The same soft-target idea now distills black-box LLMs through an aligned proxy (the Proxy-KD explainer) and compresses many-step diffusion models into a few steps (the score-regularized consistency explainer). The temperature softmax and the soft-target loss in this 2015 paper are the common ancestor.

Footnotes & further reading

  1. The paper: Hinton, Vinyals, Dean, Distilling the Knowledge in a Neural Network (Google, NIPS 2014 Deep Learning Workshop; arXiv March 2015).
  2. The prior model-compression result: Buciluǎ, Caruana, Niculescu-Mizil, Model Compression (KDD 2006). It trains the small model on the ensemble's predicted labels over a large transfer set.
  3. The explicit logit-matching objective: Ba, Caruana, Do Deep Nets Really Need to be Deep? (2014), which regresses the small net's pre-softmax logits onto the teacher's with a squared error.
  4. The popular name "dark knowledge" comes from a Hinton talk, Dark Knowledge (TTIC, 2014), not the paper, which calls it a "rich similarity structure."
  5. Dropout as an implicit ensemble: Hinton, Srivastava, Krizhevsky, Sutskever, Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors (2012).
  6. Descendants in this library: distilling a black-box LLM through an aligned proxy in Proxy-KD, and few-step diffusion distillation in score-regularized consistency models.