Knowledge Distillation of Black-Box Large Language Models
You can't see inside GPT-4. So align an open model to it, and learn from the open model instead.
A black-box teacher gives you only its text. Proxy-KD trains an open model to imitate that teacher, then distills the open model's full output distribution into a small student.
Explaining the paperKnowledge Distillation of Black-Box Large Language ModelsGPT-4 will answer your prompt, but it will not show you its work: no probabilities, no internals, just the words it picked. That missing distribution is exactly what a small model learns fastest from.
Suppose you want a small open model, something you can run cheaply on your own hardware, to reason as well as GPT-4. The obvious move is knowledge distillation: have the big model teach the small one. It has worked for years. But the standard recipe assumes you can look inside the teacher, read the probability it assigns to every word, and have the student match those probabilities. With a proprietary model behind an API, you can't. You send a prompt and get back text. The numbers that make distillation work are sealed.
Proxy-KD, from Sun Yat-sen University and Alibaba, gets around the seal with one idea: put a second open model, a proxy, between the teacher and the student. First align the proxy to imitate the black-box teacher. Now the proxy is a stand-in whose insides you can read, so the student distills from it the way the textbook recipe describes. The student gets the black box's quality through a window the black box never opened.
A few ideas carry the method: what a soft label knows that a hard one doesn't, why a black-box teacher can only give hard ones, how an open proxy is aligned to the teacher with preference optimization, and how the student learns from the proxy while trusting it more on some examples than others. None is heavy on its own.
What a soft label knows that a hard one doesn't
Start with the thing distillation transfers, because it isn't the answer. It's the shape of the answer. Hinton, Vinyals and Dean made the point in 2015: a one-hot label says "the answer is Paris" and nothing else, but a trained model's full output distribution says "it's Paris, and if not Paris then probably Lyon or Lille, and almost certainly not banana." Those small probabilities on the wrong answers encode the similarity the model places between options, and a student that matches them learns far more per example than one that only learns which option was correct.1
Make it concrete on a single next-token step. For the prompt "the capital of France is", an open model puts most of its mass on Paris but spreads the rest in a telling way: the other French cities (Lyon, Lille, Nice) carry more probability than other capitals (London, Berlin), which carry more than random words. A hard label is the single token you observe. The soft distribution is the whole curve behind it, and the curve says which mistakes are near-misses.
Drag the temperature below. At low the distribution collapses to the one token a hard label gives you, and the entropy readout falls toward zero: nothing left to learn but the answer. Raise and the runner-up structure appears, the information a student can use:
Distillation, in its richest form, has the student match that whole curve. Written as a loss, it is a Kullback-Leibler divergence between the teacher's distribution and the student's, summed over the tokens of the answer:
Read it as: wherever the teacher places probability, push the student to place it too. The teacher sits on the left of the divergence, which is the forward direction and matters later: forward KL penalizes the student for missing any token the teacher supports, so the student spreads to cover the teacher's options rather than collapsing onto its single favorite. (That covering behavior is general knowledge about which way you write the KL, not a claim the paper proves; reverse-KL methods, with the student on the left, instead let it commit to one mode.) Everything in Proxy-KD is built to make this equation usable when the teacher won't hand you .
A black box gives only text
The obstacle sits in equation (2): it needs , the teacher's probability for every candidate token, and GPT-4 will not give you that. It returns a string. So the only thing you can do with a black-box teacher's output is treat its text as ground truth and have the student copy it, token by token, by maximizing likelihood:
This is ordinary supervised fine-tuning. It is also exactly equation (2) with the teacher distribution replaced by a one-hot spike on the token it happened to emit. You have thrown away the entire curve and kept the single sample. That is the gap between black-box and white-box distillation, and it is the reason small models trained on GPT-4 text (Alpaca, Vicuna, Orca) plateau where they do: they learn the teacher's answers without ever seeing the teacher's uncertainty.
There is a second, quieter problem, and it bites even if you somehow had the teacher's distribution. Distillation degrades when the teacher is far more capable than the student. Cho and Hariharan showed in 2019 that a more accurate teacher can be a worse teacher: the gap in capacity is too wide for the small model to follow. GPT-4 to a 7B model is exactly that kind of gap. So you want something that bridges it, an intermediate model closer to the student's level. TAKD proposed precisely this, a "teacher assistant" in the middle, but in a white-box vision setting and with no step to make the assistant match the teacher. Hold that omission; it decides everything.
Slip an open proxy between them
Proxy-KD's move is to insert an open model, the proxy , and solve both problems at once. Because the proxy is open, its full distribution is readable, so the student can do real soft-label distillation against it. And because the proxy is chosen to sit between the teacher and the student in capability, it softens the capacity gap.
One detail surprises people: the proxy is larger than the student, not smaller. In the paper the teacher is GPT-4, the proxy is Llama-2-70B, and the student is Llama-2-7B. The proxy has to be capable enough to imitate GPT-4 closely; a weak proxy imitates it badly and passes that badness on. The paper confirms this directly: a 70B proxy aligns better with the teacher, and yields a better student, than a 13B proxy. The proxy is a high-capacity stand-in for an even-higher-capacity teacher, well above the student it serves.
The pipeline runs in three stages over three slices of data. The training set is one million GPT-4 responses (from OpenOrca and Nectar, plus synthetic data on benchmark training sets), cut into 10% for warming up the proxy, 45% for aligning the proxy to the teacher, and 45% for distilling into the student. Step through the stages below: stage 1 and 2 pull the proxy onto the teacher, and only in stage 3 does the student learn, from the teacher's text and the aligned proxy's distribution at once.
Pull the proxy onto the teacher: copy its text (NLL) and prefer the teacher’s replies over the proxy’s own (DPO).
The key word is aligned. An open proxy is only useful as a teacher stand-in if its distribution actually matches what the black box would have produced. Drop in a raw Llama-2-70B and distill from it, and you are distilling from the wrong teacher. So the first half of the method is one job: make the proxy a faithful echo of GPT-4.
Align the proxy: SFT, then DPO
Alignment runs in two passes. First a warm-up and a round of plain fine-tuning on the teacher's text, the same NLL as equation (1) but applied to the proxy:
Copying text gets the proxy into the right neighborhood, but it leaves a residual mismatch: trained only to maximize the likelihood of teacher answers, the proxy still puts mass on replies the teacher never would. The second pass closes that gap by comparison rather than imitation. For each prompt, sample the proxy's own reply and pair it against the teacher's reply , then train the proxy to prefer the teacher's. That is exactly the setup Direct Preference Optimization was built for: turn " is better than " into a single differentiable loss, no reward model and no reinforcement-learning loop.
Walk through it. Each log ratio measures how much more likely the current proxy makes a reply than a frozen reference does. The teacher's reply is the "chosen" one and its ratio should go up; the proxy's own reply is the "rejected" one and its ratio should go down. The difference of the two is a margin; sets how hard the loss leans on it, which in DPO is the strength of the leash keeping the proxy near its reference. The is minimized when that margin is large and positive, that is, when the proxy has moved decisively toward the teacher's reply and away from its own.
One sign here is worth pinning down, because the paper prints it the other way. Standard DPO is a negative log-sigmoid loss, , and Proxy-KD minimizes it alongside the NLL. The published equation (4) drops the minus and writes . Minimizing of the margin would push the margin the wrong way, toward preferring the proxy's own reply over the teacher's, the exact opposite of the stated goal. So the minus belongs there; we keep it, and so does any correct implementation. (You can tell from the paper's own results that the code is right: after alignment the proxy's top-1 token agreement with the teacher rises, and dropping this preference step costs the student real points.)
Two design choices set this apart from textbook DPO, and both come from the iterative loop. The reference is not a fixed checkpoint but the proxy from the previous round, re-anchored each of the iterations, so the proxy keeps chasing a moving target it just left behind. And the rejected reply is drawn fresh from the current proxy each round, on-policy, rather than from a static dataset. That online sampling is the dominant cost of the method, the price of keeping the comparison honest. The proxy update is the sum of the copy and the preference:
Watch the proxy converge below. The amber outline is the teacher's next-token distribution, fixed. The violet bars are the proxy, starting with its mass in the wrong place (its own preferred reply, the rejected token) and pulled onto the teacher as the iterations advance. The agreement readout climbs but stops short of 100%: even a well-aligned proxy is not a perfect copy, and the next stage is built to handle that residual gap.
# proxy alignment, one DPO round (ref = the proxy from the last round)
ref = freeze(proxy) # pi_p^(i-1), the reference
for x, y in D_p: # y: the teacher's reply = chosen
y_hat = proxy.sample(x) # the proxy's own reply = rejected
nll = -lp(proxy, y, x) # Eq 3: copy the teacher's text
win = lp(proxy, y, x) - lp(ref, y, x)
lose = lp(proxy, y_hat, x) - lp(ref, y_hat, x)
pref = -log(sigmoid(beta * (win - lose))) # Eq 4, with the minus
(nll + pref).backward(); opt.step() # Eq 6
# repeat for k = 16 rounds; lp(m, y, x) = log prob of y under model mAfter 16 rounds the proxy is a strong, GPT-4-flavored model in its own right (the aligned 70B proxy scores around 87 on ARC and 78 on GSM8K, well above any 7B student), and crucially its distribution is one you can read. The teacher's knowledge now lives in an open model. The student can finally have the soft labels equation (2) calls for.
Distill the aligned proxy
Student training combines the two sources. The hard labels stay: the student still does plain NLL on the teacher's text, equation (1) restricted to the distillation slice . On top of that, the soft labels arrive from the aligned proxy. Because the proxy is open, equation (2) finally typechecks, with the proxy standing in for the unreachable teacher:
Same forward KL, proxy on the left, so the student is pushed to cover every token the proxy thinks plausible. In practice the proxy's distribution is computed once, offline, and only the top-10 token logits per step are stored, since almost all the probability mass sits on a handful of tokens anyway. That keeps the soft labels cheap to carry through training. The student now learns the answer (from the teacher's text) and the shape of the answer (from the proxy's curve) together, the reason for recovering the soft labels in the first place.
Trust the proxy sample by sample
The proxy is aligned but imperfect, and the imperfection is uneven: on some prompts it mirrors the teacher closely, on others it drifts. Distilling its distribution blindly would teach the student the proxy's mistakes on exactly the prompts where the proxy is least trustworthy. So Proxy-KD weights each example by how faithfully the proxy stands in for the teacher there.
The measure of faithfulness is already in hand: , the proxy's log-likelihood of the teacher's reply . If the proxy would readily have produced the teacher's answer, it assigns that answer high probability, so the two agree on this example. Standardize that quantity across the distillation set (subtract the mean , divide by the standard deviation ) to get a z-score, then squash it through a sigmoid into a weight in :
The standardization divides by , the standard deviation, not by the variance , which is only its definition. The sigmoid does the rest: an example where the proxy models the teacher's reply one standard deviation better than average gets and weight ; one standard deviation worse gets and weight ; the average-likelihood example lands exactly at . Drag the marker to read any sample's weight:
The weight multiplies only the soft term, which gives the weighted KL of equation (10):
The hard-label NLL stays unweighted: even on a low-trust example, the teacher's text is still ground truth, so the student should keep learning it. The full student objective adds the two, with a single knob setting how much the soft labels count:
The large is not a thumb on the scale toward the soft labels so much as a units fix: a per-token KL over a peaked distribution is numerically tiny next to a sequence NLL, so it needs a big multiplier just to register. This weighting is a heuristic, not something the paper derives from a bound, and the ablation later shows it helps on average, though not on every benchmark. All of it assembles into one student step:
# student distillation step (proxy frozen, top-10 logits cached)
x, y = sample(D_s) # y: the teacher's reply (hard label)
nll = -lp(student, y, x) # Eq 7: hard-label SFT on teacher text
# soft labels: the aligned proxy's distribution at each token of y
kl = KL(proxy_p(y, x) || student_p(y, x)) # Eq 8, forward KL, proxy first
z = (lp(proxy, y, x) - mu) / gamma # how well the proxy models teacher's y
w = sigmoid(z) # Eq 9: per-sample trust, in (0, 1)
loss = nll + alpha * w * kl # Eq 11, alpha = 100
loss.backward(); opt.step() # mu, gamma precomputed over D_sIt beats both kinds of KD
The comparison that matters is against the two families Proxy-KD is trying to combine: white-box KD (which has rich soft labels but only from a weaker open teacher) and black-box KD (which has the strong teacher but only its text). On a Llama-2-7B student across six reasoning and knowledge benchmarks, Proxy-KD averages 56.78, ahead of the best black-box baseline (vanilla SFT on GPT-4 text, 53.66), the best white-box baseline (MiniLLM, 53.87), and a teacher-assistant baseline (TAKD, 52.82). The lead is largest on GSM8K, where it scores 53.07, several points clear of the field; on BBH it scores 53.40, only narrowly edging the next-best methods. Switch benchmarks below and watch where the teal bar leads and where it merely ties:
Proxy-KD · Proxy-KD (aligned 70B proxy + weighted KL) — 56.78
Two caveats the figure makes easy to check. Proxy-KD does not win everywhere: on MMLU it lands at 51.35, a hair behind plain forward-KL distillation, and on CSQA the TAKD baseline edges it. And the gap to the teacher is still large, GPT-4 scores 88 on BBH where the best student reaches the low 50s. This is distillation into a 7B model, not a way to reach GPT-4. The narrower claim holds: given the same small student and the same GPT-4 data, routing the knowledge through an aligned proxy beats every other way of using it.
Which piece is load-bearing
The ablations are where the method's logic shows, because they answer the obvious skeptic's question: does the proxy really do anything, or is it an expensive detour? Knock out one component at a time and each one leaves a mark. Remove the proxy entirely and you are back to plain black-box SFT; the student drops 6.72 on BBH and 3.56 on GSM8K, the soft-label signal gone.
The most telling bar is the one that keeps the proxy but skips its alignment. Distill from a raw, un-aligned Llama-2-70B and BBH falls 10.40, further than removing the proxy altogether. An unaligned proxy is worse than no proxy: it is a confident teacher pointing the wrong way, and the student follows it off the road. The same lesson is hidden in TAKD's number: its unaligned assistant (52.82) scores below plain black-box KD (53.66). A model in the middle helps only if it has first been made to match the teacher. Select the bars to compare:
−align · The 70B proxy is used raw, never aligned to the teacher. On BBH it scores below using no proxy: a misaligned proxy actively misleads the student.
The remaining two pieces fall in line. Dropping the DPO preference step, so the proxy is aligned by copying alone, costs the student across the board (ARC −4.98, MMLU −2.56): imitation gets the proxy close, and only the preference step closes the rest. Replacing the weighted KL with a plain one costs 2.60 on AGIEval and 1.88 on MMLU, though it actually raises ARC by 0.72. A heuristic behaves exactly like that: helpful on average, not everywhere.
Each part of the method has one job, and laid end to end they read as a few plain sentences. Soft labels carry more than hard ones, but a black box only gives hard ones, so build an open proxy that gives soft ones, align it carefully (because an unaligned one misleads), distill its distribution into the student, and lean on it harder where it agrees with the teacher. Every loss term in the paper is one of those sentences. Two costs come with that: the alignment phase and its online sampling roughly double the training, and the experiments cover only Llama models, leaving Mistral and Qwen backbones for future work. Align a proxy once, though, and a sealed teacher's distribution becomes something a small open model can learn from.
Questions you might still have
Why not just fine-tune the student on GPT-4 text directly?
That is vanilla black-box KD, the −proxy bar in the ablation. It only ever sees the one token the teacher sampled, the hard label, and never the distribution over alternatives. Recovering that distribution from an aligned open proxy is worth 6.72 points on BBH.
If GPT-4 is the teacher, why is the proxy bigger than the student?
The proxy (Llama-2-70B) is the white-box stand-in, not the teacher. It needs enough capacity to imitate GPT-4 closely; a 70B proxy aligns better and yields a better student than a 13B one. It sits between teacher and student in capability, well above the 7B student.
Is this just TAKD with an extra model in the middle?
TAKD inserts an intermediate model but never aligns it to the teacher, and here its unaligned assistant (52.82) scores below plain black-box KD (53.66). The alignment phase, SFT plus iterative DPO, turns a bridge that hurts into one that helps.
Why forward KL and not the reverse KL of MiniLLM?
Forward KL puts the proxy on the left of the divergence, so the student is penalized for missing any token the proxy supports and learns to cover its modes. MiniLLM and GKD use reverse KL, which lets the student commit to one mode. Proxy-KD chooses the covering direction.
What does the sample weight actually buy?
It down-weights examples where the proxy is a poor stand-in for the teacher, measured by the low probability the proxy assigns the teacher reply. It is a heuristic: the ablation shows it helps the average but not every benchmark, raising ARC when removed.
Footnotes & further reading
- The phrase "dark knowledge" for this transferable structure came from Hinton's later talks, not the 2015 paper itself, which says only that soft targets carry "much more information per training case than hard targets." The paper: Hinton, Vinyals, Dean, Distilling the Knowledge in a Neural Network. ↩
- The paper: Chen, Chen, Yi, Quan, Li, Yan, Zhang, Knowledge Distillation of Black-Box Large Language Models (Sun Yat-sen University & Alibaba, 2024).
- The alignment loss is Direct Preference Optimization (Rafailov et al., arXiv:2305.18290), here run iteratively with the reference re-anchored to the previous round's proxy.
- The capacity-gap effect is from Cho & Hariharan, On the Efficacy of Knowledge Distillation, and the teacher-assistant idea from Mirzadeh et al., Improved Knowledge Distillation via Teacher Assistant (TAKD).
- Reverse-KL distillation baselines: MiniLLM (Gu et al., arXiv:2306.08543) and GKD (Agarwal et al., arXiv:2306.13649). Both models are trained with LoRA in the paper's experiments.
How could this explainer be improved? Found an error, or something unclear? I read every message.