LLMs · Distillation

Knowledge Distillation of Black-Box Large Language Models

You can't see inside GPT-4. So align an open model to it, and learn from the open model instead.

A black-box teacher gives you only its text. Proxy-KD trains an open model to imitate that teacher, then distills the open model's full output distribution into a small student.

Explaining the paperKnowledge Distillation of Black-Box Large Language ModelsChen, Chen, Yi, Quan, Li, Yan, Zhang · Sun Yat-sen University · Alibaba · 2024 · arXiv:2401.07013 ↗

GPT-4 will answer your prompt, but it will not show you its work: no probabilities, no internals, just the words it picked. That missing distribution is exactly what a small model learns fastest from.

Suppose you want a small open model, something you can run cheaply on your own hardware, to reason as well as GPT-4. The obvious move is knowledge distillation: have the big model teach the small one. It has worked for years. But the standard recipe assumes you can look inside the teacher, read the probability it assigns to every word, and have the student match those probabilities. With a proprietary model behind an API, you can't. You send a prompt and get back text. The numbers that make distillation work are sealed.

Proxy-KD, from Sun Yat-sen University and Alibaba, gets around the seal with one idea: put a second open model, a proxy, between the teacher and the student. First align the proxy to imitate the black-box teacher. Now the proxy is a stand-in whose insides you can read, so the student distills from it the way the textbook recipe describes. The student gets the black box's quality through a window the black box never opened.

A few ideas carry the method: what a soft label knows that a hard one doesn't, why a black-box teacher can only give hard ones, how an open proxy is aligned to the teacher with preference optimization, and how the student learns from the proxy while trusting it more on some examples than others. None is heavy on its own.

What a soft label knows that a hard one doesn't

Start with the thing distillation transfers, because it isn't the answer. It's the shape of the answer. Hinton, Vinyals and Dean made the point in 2015: a one-hot label says "the answer is Paris" and nothing else, but a trained model's full output distribution says "it's Paris, and if not Paris then probably Lyon or Lille, and almost certainly not banana." Those small probabilities on the wrong answers encode the similarity the model places between options, and a student that matches them learns far more per example than one that only learns which option was correct.¹

Make it concrete on a single next-token step. For the prompt "the capital of France is", an open model puts most of its mass on Paris but spreads the rest in a telling way: the other French cities (Lyon, Lille, Nice) carry more probability than other capitals (London, Berlin), which carry more than random words. A hard label is the single token you observe. The soft distribution is the whole curve behind it, and the curve says which mistakes are near-misses.

Drag the temperature below. At low $T$ the distribution collapses to the one token a hard label gives you, and the entropy readout falls toward zero: nothing left to learn but the answer. Raise $T$ and the runner-up structure appears, the information a student can use:

Figure 1 · hard label vs soft distribution

TT = 1.00

One next-token step. The sampled token (amber) is all a black-box teacher returns. An open model also hands over the rest of the distribution (violet): the near-misses ranked by plausibility. Lower the temperature to collapse the curve to a single token; raise it to reveal the structure. Entropy, in bits, counts how much the curve carries.

Distillation, in its richest form, has the student match that whole curve. Written as a loss, it is a Kullback-Leibler divergence between the teacher's distribution and the student's, summed over the tokens of the answer:

\mathcal{L}_{\text{KL}} = \mathbb{E}_{(x,y)\sim\mathcal{D}}\big[\, \mathbb{D}_{\text{KL}}\!\big(\pi_t(y\mid x)\,\|\,\pi_s(y\mid x)\big) \,\big]

(2)

Read it as: wherever the teacher $\pi_t$ places probability, push the student $\pi_s$ to place it too. The teacher sits on the left of the divergence, which is the forward direction and matters later: forward KL penalizes the student for missing any token the teacher supports, so the student spreads to cover the teacher's options rather than collapsing onto its single favorite. (That covering behavior is general knowledge about which way you write the KL, not a claim the paper proves; reverse-KL methods, with the student on the left, instead let it commit to one mode.) Everything in Proxy-KD is built to make this equation usable when the teacher won't hand you $\pi_t$ .

A black box gives only text

The obstacle sits in equation (2): it needs $\pi_t(y\mid x)$ , the teacher's probability for every candidate token, and GPT-4 will not give you that. It returns a string. So the only thing you can do with a black-box teacher's output is treat its text as ground truth and have the student copy it, token by token, by maximizing likelihood:

\mathcal{L}_{\text{NLL}} = \mathbb{E}_{(x,y)\sim\mathcal{D}}\big[\, -\log \pi_s(y\mid x) \,\big]

(1)

This is ordinary supervised fine-tuning. It is also exactly equation (2) with the teacher distribution replaced by a one-hot spike on the token it happened to emit. You have thrown away the entire curve and kept the single sample. That is the gap between black-box and white-box distillation, and it is the reason small models trained on GPT-4 text (Alpaca, Vicuna, Orca) plateau where they do: they learn the teacher's answers without ever seeing the teacher's uncertainty.

There is a second, quieter problem, and it bites even if you somehow had the teacher's distribution. Distillation degrades when the teacher is far more capable than the student. Cho and Hariharan showed in 2019 that a more accurate teacher can be a worse teacher: the gap in capacity is too wide for the small model to follow. GPT-4 to a 7B model is exactly that kind of gap. So you want something that bridges it, an intermediate model closer to the student's level. TAKD proposed precisely this, a "teacher assistant" in the middle, but in a white-box vision setting and with no step to make the assistant match the teacher. Hold that omission; it decides everything.

Slip an open proxy between them

Proxy-KD's move is to insert an open model, the proxy $\pi_p$ , and solve both problems at once. Because the proxy is open, its full distribution is readable, so the student can do real soft-label distillation against it. And because the proxy is chosen to sit between the teacher and the student in capability, it softens the capacity gap.

One detail surprises people: the proxy is larger than the student, not smaller. In the paper the teacher is GPT-4, the proxy is Llama-2-70B, and the student is Llama-2-7B. The proxy has to be capable enough to imitate GPT-4 closely; a weak proxy imitates it badly and passes that badness on. The paper confirms this directly: a 70B proxy aligns better with the teacher, and yields a better student, than a 13B proxy. The proxy is a high-capacity stand-in for an even-higher-capacity teacher, well above the student it serves.

The pipeline runs in three stages over three slices of data. The training set is one million GPT-4 responses (from OpenOrca and Nectar, plus synthetic data on benchmark training sets), cut into 10% for warming up the proxy, 45% for aligning the proxy to the teacher, and 45% for distilling into the student. Step through the stages below: stage 1 and 2 pull the proxy onto the teacher, and only in stage 3 does the student learn, from the teacher's text and the aligned proxy's distribution at once.

Figure 2 · the Proxy-KD pipeline

Pull the proxy onto the teacher: copy its text (NLL) and prefer the teacher’s replies over the proxy’s own (DPO).

Three models: the locked GPT-4 teacher, an open Llama-2-70B proxy, and the Llama-2-7B student. Stage 1 warms the proxy on teacher text; stage 2 aligns it (NLL + DPO); stage 3 distills into the student, with hard labels from the teacher and a weighted soft distribution from the now-frozen proxy. Click through the stages to see what flows when.

The key word is aligned. An open proxy is only useful as a teacher stand-in if its distribution actually matches what the black box would have produced. Drop in a raw Llama-2-70B and distill from it, and you are distilling from the wrong teacher. So the first half of the method is one job: make the proxy a faithful echo of GPT-4.

Align the proxy: SFT, then DPO

Alignment runs in two passes. First a warm-up and a round of plain fine-tuning on the teacher's text, the same NLL as equation (1) but applied to the proxy:

\mathcal{L}_{\text{Proxy-NLL}} = \mathbb{E}_{(x,y)\sim\mathcal{D}_p}\big[\, -\log \pi_p(y\mid x) \,\big]

(3)

Copying text gets the proxy into the right neighborhood, but it leaves a residual mismatch: trained only to maximize the likelihood of teacher answers, the proxy still puts mass on replies the teacher never would. The second pass closes that gap by comparison rather than imitation. For each prompt, sample the proxy's own reply $\hat{y}$ and pair it against the teacher's reply $y$ , then train the proxy to prefer the teacher's. That is exactly the setup Direct Preference Optimization was built for: turn " $y$ is better than $\hat{y}$ " into a single differentiable loss, no reward model and no reinforcement-learning loop.

\mathcal{L}_{\text{DPO}}^{(i)} = -\log\sigma\!\left[\, \beta\log\frac{\pi_p^{(i)}(y\mid x)}{\pi_p^{(i-1)}(y\mid x)} \;-\; \beta\log\frac{\pi_p^{(i)}(\hat{y}\mid x)}{\pi_p^{(i-1)}(\hat{y}\mid x)} \,\right]

(4)

Walk through it. Each log ratio measures how much more likely the current proxy $\pi_p^{(i)}$ makes a reply than a frozen reference $\pi_p^{(i-1)}$ does. The teacher's reply $y$ is the "chosen" one and its ratio should go up; the proxy's own reply $\hat{y}$ is the "rejected" one and its ratio should go down. The difference of the two is a margin; $\beta$ sets how hard the loss leans on it, which in DPO is the strength of the leash keeping the proxy near its reference. The $-\log\sigma$ is minimized when that margin is large and positive, that is, when the proxy has moved decisively toward the teacher's reply and away from its own.

One sign here is worth pinning down, because the paper prints it the other way. Standard DPO is a negative log-sigmoid loss, $-\log\sigma[\cdot]$ , and Proxy-KD minimizes it alongside the NLL. The published equation (4) drops the minus and writes $+\log\sigma[\cdot]$ . Minimizing $+\log\sigma$ of the margin would push the margin the wrong way, toward preferring the proxy's own reply over the teacher's, the exact opposite of the stated goal. So the minus belongs there; we keep it, and so does any correct implementation. (You can tell from the paper's own results that the code is right: after alignment the proxy's top-1 token agreement with the teacher rises, and dropping this preference step costs the student real points.)

Two design choices set this apart from textbook DPO, and both come from the iterative loop. The reference $\pi_p^{(i-1)}$ is not a fixed checkpoint but the proxy from the previous round, re-anchored each of the $k=16$ iterations, so the proxy keeps chasing a moving target it just left behind. And the rejected reply $\hat{y}$ is drawn fresh from the current proxy each round, on-policy, rather than from a static dataset. That online sampling is the dominant cost of the method, the price of keeping the comparison honest. The proxy update is the sum of the copy and the preference:

\mathcal{L}_{\text{Proxy}}^{(i)} = \mathcal{L}_{\text{Proxy-NLL}}^{(i)} + \mathcal{L}_{\text{Pref}}^{(i)}, \qquad i = 1,\dots,k

(6)

Watch the proxy converge below. The amber outline is the teacher's next-token distribution, fixed. The violet bars are the proxy, starting with its mass in the wrong place (its own preferred reply, the rejected token) and pulled onto the teacher as the iterations advance. The agreement readout climbs but stops short of 100%: even a well-aligned proxy is not a perfect copy, and the next stage is built to handle that residual gap.

Figure 3 · aligning the proxy with iterative DPO

iteri = 6 / 16

The proxy (violet bars) pulled onto the fixed teacher target (amber outline) over k = 16 DPO rounds. Each round prefers the teacher's reply (▲ chosen) over the proxy's own sample (▼ rejected), moving mass from one to the other. Overlap, the top-1 agreement the paper measures, rises from roughly half to about 90% and stays below 100%. Schematic trajectory; the rise and the residual gap are the paper's qualitative findings.

# proxy alignment, one DPO round (ref = the proxy from the last round)
ref = freeze(proxy)                   # pi_p^(i-1), the reference
for x, y in D_p:                      # y: the teacher's reply = chosen
    y_hat = proxy.sample(x)           # the proxy's own reply = rejected
    nll = -lp(proxy, y, x)            # Eq 3: copy the teacher's text
    win  = lp(proxy, y, x)     - lp(ref, y, x)
    lose = lp(proxy, y_hat, x) - lp(ref, y_hat, x)
    pref = -log(sigmoid(beta * (win - lose)))   # Eq 4, with the minus
    (nll + pref).backward(); opt.step()         # Eq 6
# repeat for k = 16 rounds; lp(m, y, x) = log prob of y under model m

After 16 rounds the proxy is a strong, GPT-4-flavored model in its own right (the aligned 70B proxy scores around 87 on ARC and 78 on GSM8K, well above any 7B student), and crucially its distribution is one you can read. The teacher's knowledge now lives in an open model. The student can finally have the soft labels equation (2) calls for.

Distill the aligned proxy

Student training combines the two sources. The hard labels stay: the student still does plain NLL on the teacher's text, equation (1) restricted to the distillation slice $\mathcal{D}_s$ . On top of that, the soft labels arrive from the aligned proxy. Because the proxy is open, equation (2) finally typechecks, with the proxy $\pi_p$ standing in for the unreachable teacher:

\mathcal{L}_{\text{Student-KL}} = \mathbb{E}_{(x,y)\sim\mathcal{D}_s}\big[\, \mathbb{D}_{\text{KL}}\!\big(\pi_p(y\mid x)\,\|\,\pi_s(y\mid x)\big) \,\big]

(8)

Same forward KL, proxy on the left, so the student is pushed to cover every token the proxy thinks plausible. In practice the proxy's distribution is computed once, offline, and only the top-10 token logits per step are stored, since almost all the probability mass sits on a handful of tokens anyway. That keeps the soft labels cheap to carry through training. The student now learns the answer (from the teacher's text) and the shape of the answer (from the proxy's curve) together, the reason for recovering the soft labels in the first place.

Trust the proxy sample by sample

The proxy is aligned but imperfect, and the imperfection is uneven: on some prompts it mirrors the teacher closely, on others it drifts. Distilling its distribution blindly would teach the student the proxy's mistakes on exactly the prompts where the proxy is least trustworthy. So Proxy-KD weights each example by how faithfully the proxy stands in for the teacher there.

The measure of faithfulness is already in hand: $\log \pi_p(y\mid x)$ , the proxy's log-likelihood of the teacher's reply $y$ . If the proxy would readily have produced the teacher's answer, it assigns that answer high probability, so the two agree on this example. Standardize that quantity across the distillation set (subtract the mean $\mu$ , divide by the standard deviation $\gamma$ ) to get a z-score, then squash it through a sigmoid into a weight in $(0,1)$ :

w(x,y) = \sigma\!\left( \frac{\log \pi_p(y\mid x) - \mu}{\gamma} \right), \quad \mu = \mathbb{E}_{\mathcal{D}_s}\!\big[\log \pi_p(y\mid x)\big], \quad \gamma^2 = \mathrm{Var}_{\mathcal{D}_s}\!\big[\log \pi_p(y\mid x)\big]

(9)

The standardization divides by $\gamma$ , the standard deviation, not by the variance $\gamma^2$ , which is only its definition. The sigmoid does the rest: an example where the proxy models the teacher's reply one standard deviation better than average gets $z=+1$ and weight $\sigma(1)\approx 0.73$ ; one standard deviation worse gets $z=-1$ and weight $\approx 0.27$ ; the average-likelihood example lands exactly at $0.5$ . Drag the marker to read any sample's weight:

Figure 4 · the per-sample trust weight

samplez = +0.60σ

The weight is a sigmoid of the standardized log-likelihood the proxy assigns the teacher's reply. Samples the proxy models well (right) keep near-full weight; samples it models poorly (left) are discounted toward zero. The average-likelihood sample sits at exactly 0.5. Most samples cluster near the middle (the faint hump).

The weight multiplies only the soft term, which gives the weighted KL of equation (10):

\mathcal{L}_{\text{Weight-KL}} = \mathbb{E}_{(x,y)\sim\mathcal{D}_s}\big[\, w(x,y)\,\mathbb{D}_{\text{KL}}\!\big(\pi_p(y\mid x)\,\|\,\pi_s(y\mid x)\big) \,\big]

(10)

The hard-label NLL stays unweighted: even on a low-trust example, the teacher's text is still ground truth, so the student should keep learning it. The full student objective adds the two, with a single knob $\alpha$ setting how much the soft labels count:

\mathcal{L}_{\text{Student}} = \mathcal{L}_{\text{Student-NLL}} + \alpha\,\mathcal{L}_{\text{Weight-KL}}, \qquad \alpha = 100

(11)

The large $\alpha = 100$ is not a thumb on the scale toward the soft labels so much as a units fix: a per-token KL over a peaked distribution is numerically tiny next to a sequence NLL, so it needs a big multiplier just to register. This weighting is a heuristic, not something the paper derives from a bound, and the ablation later shows it helps on average, though not on every benchmark. All of it assembles into one student step:

# student distillation step (proxy frozen, top-10 logits cached)
x, y = sample(D_s)                    # y: the teacher's reply (hard label)
nll = -lp(student, y, x)              # Eq 7: hard-label SFT on teacher text
# soft labels: the aligned proxy's distribution at each token of y
kl  = KL(proxy_p(y, x) || student_p(y, x))   # Eq 8, forward KL, proxy first
z   = (lp(proxy, y, x) - mu) / gamma  # how well the proxy models teacher's y
w   = sigmoid(z)                      # Eq 9: per-sample trust, in (0, 1)
loss = nll + alpha * w * kl           # Eq 11, alpha = 100
loss.backward(); opt.step()           # mu, gamma precomputed over D_s

It beats both kinds of KD

The comparison that matters is against the two families Proxy-KD is trying to combine: white-box KD (which has rich soft labels but only from a weaker open teacher) and black-box KD (which has the strong teacher but only its text). On a Llama-2-7B student across six reasoning and knowledge benchmarks, Proxy-KD averages 56.78, ahead of the best black-box baseline (vanilla SFT on GPT-4 text, 53.66), the best white-box baseline (MiniLLM, 53.87), and a teacher-assistant baseline (TAKD, 52.82). The lead is largest on GSM8K, where it scores 53.07, several points clear of the field; on BBH it scores 53.40, only narrowly edging the next-best methods. Switch benchmarks below and watch where the teal bar leads and where it merely ties:

Figure 5 · Proxy-KD vs the baselines

Proxy-KD · Proxy-KD (aligned 70B proxy + weighted KL) — 56.78

Six distillation methods into the same Llama-2-7B student, one benchmark at a time. Proxy-KD (teal, rightmost) leads on the average and the reasoning benchmarks BBH and GSM8K, and sits level with the field on MMLU and CSQA. The GPT-4 teacher ceiling is annotated for each benchmark. Click any bar for its method.

Two caveats the figure makes easy to check. Proxy-KD does not win everywhere: on MMLU it lands at 51.35, a hair behind plain forward-KL distillation, and on CSQA the TAKD baseline edges it. And the gap to the teacher is still large, GPT-4 scores 88 on BBH where the best student reaches the low 50s. This is distillation into a 7B model, not a way to reach GPT-4. The narrower claim holds: given the same small student and the same GPT-4 data, routing the knowledge through an aligned proxy beats every other way of using it.

Which piece is load-bearing

The ablations are where the method's logic shows, because they answer the obvious skeptic's question: does the proxy really do anything, or is it an expensive detour? Knock out one component at a time and each one leaves a mark. Remove the proxy entirely and you are back to plain black-box SFT; the student drops 6.72 on BBH and 3.56 on GSM8K, the soft-label signal gone.

The most telling bar is the one that keeps the proxy but skips its alignment. Distill from a raw, un-aligned Llama-2-70B and BBH falls 10.40, further than removing the proxy altogether. An unaligned proxy is worse than no proxy: it is a confident teacher pointing the wrong way, and the student follows it off the road. The same lesson is hidden in TAKD's number: its unaligned assistant (52.82) scores below plain black-box KD (53.66). A model in the middle helps only if it has first been made to match the teacher. Select the bars to compare:

Figure 6 · ablating one piece at a time

−align · The 70B proxy is used raw, never aligned to the teacher. On BBH it scores below using no proxy: a misaligned proxy actively misleads the student.

The full model is the teal bar and the dashed line; each amber bar removes one piece. On BBH, −align (an unaligned proxy) drops 10.40, deeper than −proxy (no proxy at all) at 6.72: a misaligned bridge is worse than none. Switch benchmarks to see which pieces matter where; the trust weight, for one, helps the average but lifts ARC when removed.

The remaining two pieces fall in line. Dropping the DPO preference step, so the proxy is aligned by copying alone, costs the student across the board (ARC −4.98, MMLU −2.56): imitation gets the proxy close, and only the preference step closes the rest. Replacing the weighted KL with a plain one costs 2.60 on AGIEval and 1.88 on MMLU, though it actually raises ARC by 0.72. A heuristic behaves exactly like that: helpful on average, not everywhere.

Each part of the method has one job, and laid end to end they read as a few plain sentences. Soft labels carry more than hard ones, but a black box only gives hard ones, so build an open proxy that gives soft ones, align it carefully (because an unaligned one misleads), distill its distribution into the student, and lean on it harder where it agrees with the teacher. Every loss term in the paper is one of those sentences. Two costs come with that: the alignment phase and its online sampling roughly double the training, and the experiments cover only Llama models, leaving Mistral and Qwen backbones for future work. Align a proxy once, though, and a sealed teacher's distribution becomes something a small open model can learn from.

Provenance Verified against primary literature

Hinton et al. (2015)Knowledge distillation: a soft target distribution carries more than a hard label.

DPO (Rafailov et al., 2024)The preference loss used to align the proxy to the teacher.

TAKD (Mirzadeh et al., 2020)An intermediate model to bridge the teacher-student capacity gap.

Cho & Hariharan (2019)A larger teacher can distill worse: the capacity-gap effect.

correctionThe paper prints the DPO alignment loss (Eq 4) as +log σ[·], with no leading minus. Minimized alongside the NLL, that drives the preference the wrong way; the standard, correct DPO form is −log σ[·]. We teach the minus, and the paper's own results confirm the implementation uses it.

Questions you might still have

Why not just fine-tune the student on GPT-4 text directly?
That is vanilla black-box KD, the −proxy bar in the ablation. It only ever sees the one token the teacher sampled, the hard label, and never the distribution over alternatives. Recovering that distribution from an aligned open proxy is worth 6.72 points on BBH.

If GPT-4 is the teacher, why is the proxy bigger than the student?
The proxy (Llama-2-70B) is the white-box stand-in, not the teacher. It needs enough capacity to imitate GPT-4 closely; a 70B proxy aligns better and yields a better student than a 13B one. It sits between teacher and student in capability, well above the 7B student.

Is this just TAKD with an extra model in the middle?
TAKD inserts an intermediate model but never aligns it to the teacher, and here its unaligned assistant (52.82) scores below plain black-box KD (53.66). The alignment phase, SFT plus iterative DPO, turns a bridge that hurts into one that helps.

Why forward KL and not the reverse KL of MiniLLM?
Forward KL puts the proxy on the left of the divergence, so the student is penalized for missing any token the proxy supports and learns to cover its modes. MiniLLM and GKD use reverse KL, which lets the student commit to one mode. Proxy-KD chooses the covering direction.

What does the sample weight actually buy?
It down-weights examples where the proxy is a poor stand-in for the teacher, measured by the low probability the proxy assigns the teacher reply. It is a heuristic: the ablation shows it helps the average but not every benchmark, raising ARC when removed.

Footnotes & further reading

The phrase "dark knowledge" for this transferable structure came from Hinton's later talks, not the 2015 paper itself, which says only that soft targets carry "much more information per training case than hard targets." The paper: Hinton, Vinyals, Dean, Distilling the Knowledge in a Neural Network. ↩
The paper: Chen, Chen, Yi, Quan, Li, Yan, Zhang, Knowledge Distillation of Black-Box Large Language Models (Sun Yat-sen University & Alibaba, 2024).
The alignment loss is Direct Preference Optimization (Rafailov et al., arXiv:2305.18290), here run iteratively with the reference re-anchored to the previous round's proxy.
The capacity-gap effect is from Cho & Hariharan, On the Efficacy of Knowledge Distillation, and the teacher-assistant idea from Mirzadeh et al., Improved Knowledge Distillation via Teacher Assistant (TAKD).
Reverse-KL distillation baselines: MiniLLM (Gu et al., arXiv:2306.08543) and GKD (Agarwal et al., arXiv:2306.13649). Both models are trained with LoRA in the paper's experiments.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.