VerifiedarXiv:2212.0435630 min
Speech · Scale

Robust Speech Recognition via Large-Scale Weak Supervision

680,000 hours of weakly labeled audio, and no fine-tuning.

Skip the clever self-supervised pre-training. Scrape a huge, messy pile of audio paired with its transcript, train one plain Transformer to predict the text, and it transcribes a wide range of speech out of the box.

Explaining the paperRobust Speech Recognition via Large-Scale Weak SupervisionRadford, Kim, Xu, Brockman, McLeavey, Sutskever · OpenAI · 2022 · arXiv:2212.04356

What if the way to make speech recognition robust was not a cleverer model, but a much larger and messier pile of labels?

For most of the last decade the recipe for state-of-the-art speech recognition went like this: take a giant mountain of audio with no transcripts, because unlabeled audio is cheap and nearly unlimited. Learn good representations of it with self-supervision, the way wav2vec 2.0 does by masking parts of the signal and predicting them. Then, because those representations cannot actually produce text on their own, fine-tune the system on a specific labeled dataset to turn it into a recognizer. This worked well enough to push the error rate on the standard benchmark, LibriSpeech, below 2%.

It also had two quiet problems. The first is friction: every new deployment needs its own fine-tuning, which takes labeled data and a skilled hand. The second is more insidious. A model fine-tuned on one dataset learns that dataset, including its quirks, and a recognizer that looks superhuman on LibriSpeech can fall apart on a phone call or a noisy restaurant. The score went up. The robustness did not come with it.

Whisper, from OpenAI, takes a different bet. There is a third kind of data sitting between cheap-but-unlabeled and clean-but-tiny: real transcripts that people already attached to audio all over the internet. They are plentiful and they are genuine labels, just noisy and inconsistent. The paper calls training on them weak supervision. Collect 680,000 hours of it, train a single ordinary sequence-to-sequence model to map audio to its transcript, and evaluate without ever fine-tuning on the test set. It works, and the model that comes out is more robust than the carefully pre-trained, carefully fine-tuned systems it competes with. That robustness is the part worth explaining.

None of the pieces is exotic. The architecture is a Transformer you have seen before. The training objective is next-token prediction. The whole story is about what happens when you change what the model is trained on rather than how it is built. The pieces come in order: what weak supervision is, what the model hears, how one decoder does many jobs, and why broad training buys a robustness that narrow training cannot.

The third kind of supervision

Two axes have always been in tension when you gather speech data. One is how much of it you can get. The other is how good the labels are. You can have a small, pristine corpus of human-checked transcripts, or a giant corpus of raw audio with no transcripts at all, and historically you had to pick.

The clean corner is gold supervision. SpeechStew, a strong example, combines seven datasets into 5,140 hours of human-validated transcripts. The labels are excellent. The trouble is that hand-checked audio does not scale; 5,140 hours is tiny next to what the field wanted.

The other corner is self-supervision. wav2vec 2.0 and its successors learn from raw audio with no transcripts, and that scales beautifully: the BigSSL line of work reached roughly 1,000,000 hours of unlabeled audio. The catch is in the name. With no labels, the model learns a representation of speech but not a way to write it down. A separate fine-tuning stage, on real transcripts, has to be bolted on to make a recognizer. The decoder is the missing piece.

Whisper takes the corner nobody had really worked. Web transcripts (captions, subtitles, posted transcripts) are real text aligned to real audio, large and labeled at once, so a model trained on them learns to produce text directly, no fine-tuning stage required; they are also inconsistent, partial, and sometimes wrong, the price of skipping the human checker. The paper is explicit that this is not self-supervision, no unsupervised pre-training at all; the supervision is real, just weak.

Figure 1 · the data plane

Whisper: 680,000 h of real (audio, transcript) pairs scraped from the web. Labeled and large, just noisy, so it transcribes zero-shot with no fine-tuning.

Two axes the field traded against: how much audio you can collect, and how good the labels are. Gold supervision is clean but tiny. Self-supervision is huge but unlabeled, so it needs fine-tuning to read out. Whisper takes the empty middle-right: 680,000 hours of real, noisy transcripts, large and labeled at once.

That single choice carries the whole paper: real labels make a zero-shot transcriber, the volume and variety of recording conditions make it general, and the cost is noise, so most of the engineering that follows goes to keeping the noise from poisoning the signal.

680,000 hours of the internet

The dataset is 680,000 hours of audio paired with transcripts. It is not all English transcription. Of those hours, 117,000 cover 96 other languages, and another 125,000 are clips in some language paired with an English translation, which become the model's speech-translation training data. The rest, about 438,000 hours, is English transcription. One dataset, three jobs.

Diversity of audio is the goal, since it is what makes the model robust. Diversity of transcript quality is not, and the raw web has plenty of bad transcripts. The worst offenders are other machines. A lot of internet captions are themselves the output of older ASR systems, and training on machine transcripts teaches a model to imitate their style, a kind of "transcript-ese," rather than learn from human labels. The paper builds heuristics to detect and drop them: existing ASR systems emit normalized text, stripped of the casing and punctuation choices a person makes almost without thinking, so an all-uppercase or all-lowercase transcript, or one that never uses commas, is a fingerprint of machine output, and it goes.

Two more filters matter. An audio language detector checks that the spoken language matches the transcript language; if they disagree, the pair is dropped, with one deliberate exception. If the audio is some other language but the transcript is English, the pair is kept as an X → en translation example. That exception is where the 125,000 hours of translation data come from. After that, fuzzy de-duplication trims repeats, and transcripts overlapping the evaluation sets (TED-LIUM, in particular) are removed so the zero-shot scores stay honest.

Audio is then cut into 30-second chunks, each paired with the transcript text that falls inside it. Chunks with no speech are kept too, at a reduced rate, so the model learns to emit the no-speech token on silence, which is voice activity detection for free. Whatever survives all of this is what the model trains on, raw text and all. There is no separate normalization step that rewrites numbers or punctuation into a canonical form; the model is asked to predict the transcript as written.

What the model actually hears

The architecture is intentionally boring. The authors wanted to study what scale and weak supervision buy, so they deliberately chose an off-the-shelf encoder-decoder Transformer and changed as little as possible, to avoid confusing "our data recipe works" with "our new architecture works." There is no conformer, no novel attention. The encoder reads the audio; the decoder writes the text.

Whisper never sees a waveform. Every 30-second chunk of 16 kHz audio is turned into an 80-channel log-Mel spectrogram, a picture of how much energy sits in each of 80 frequency bands over time, computed on 25-millisecond windows stepped every 10 milliseconds. The Mel scale spaces those bands the way human hearing does, finely at low pitches and coarsely at high ones, and the log mimics how the ear perceives loudness, so the budget of 80 channels is spent where human-relevant differences actually live instead of squandered on high-frequency detail the ear barely separates. So 30 seconds becomes 3,000 time frames stacked 80 channels tall. The encoder's front end is a small two-layer convolution stem, and its second layer has a stride of 2, which halves the time axis to 1,500 frames. That sequence of 1,500 vectors is what the Transformer attends over.

Figure 2 · the input representation
The model hears an 80-channel log-Mel spectrogram, not a waveform: frequency up the side, time across, energy as brightness. A 30-second chunk is 3,000 frames at a 10 ms hop, which the encoder's stride-2 convolution halves to the 1,500 frames the Transformer reads.

The rest is standard. Sinusoidal position embeddings are added in the encoder; the decoder uses learned positions and ties its input and output token embeddings. Text is tokenized with the same byte-level BPE vocabulary as GPT-2, refit (but kept the same size) for the multilingual models so other languages do not shatter into too many pieces. The encoder and decoder share a width and a depth, and the family scales the usual way:

The layer count is per side, so Large is 32 encoder layers and 32 decoder layers. That is the entire model. Everything interesting that follows comes from the data and the training format, not from the boxes.

One decoder, many tasks

The design that makes one model do the work of a whole pipeline is the decoder, an audio-conditional language model: it predicts text tokens one at a time, conditioned on the encoded audio. So instead of building separate systems for language identification, transcription, translation, and timestamping, four networks and four interfaces to build and serve, the paper folds all of them into the same token stream the decoder already predicts, with a handful of special tokens to say which job to do: a single set of weights shares whatever it learns across tasks and languages, and adding a new behavior is a matter of defining a new token sequence rather than training a new model.

The sequence always starts with a <|startoftranscript|> token. Next comes a token naming the language being spoken, one of 99, or a <|nospeech|> token if the chunk is silence. Then a task token, <|transcribe|> or <|translate|>, and a flag for whether to emit timestamps. After that the text itself, optionally with timestamp tokens (quantized to 20 ms) interleaved before and after each segment. Finally <|endoftranscript|>. A timestamp is just another token in the stream, so predicting when a word begins is the same operation as predicting the word, and there is no separate alignment model. Toggle the controls below and watch the same interface express different jobs:

Figure 3 · the multitask interface

Transcribe Spanish, with timestamps.

Every task is the same stream of tokens the decoder predicts. Control tokens set the language and the job; timestamp tokens wrap the text when asked; the words sit in between. Silence collapses to a single <|nospeech|> token. One interface, a whole pipeline.

Written out, the target for a transcribe-with-timestamps example is just a list:

# the decoder's target is just a stream of tokens (transcribe Spanish + times)
[ <|startoftranscript|>        # begin
  <|es|>                       # the detected language  (99 language tokens)
  <|transcribe|>               # the task  (or <|translate|> for X->English)
  <|0.00|>  el  zorro  veloz  <|2.40|>   # text, wrapped in timestamp tokens
  <|endoftranscript|> ]
# silence instead? -> <|startoftranscript|> <|nospeech|> <|endoftranscript|>

Two details earn their keep. The model can also be conditioned on the previous segment's transcript, fed in before the start token, so it can use longer context to resolve an ambiguous word; the loss is masked over that history so the model is graded only on what it should produce. And the translation task reuses the exact same machinery: a <|translate|> token tells the decoder to emit an English translation instead of a same-language transcript, trained on the X → en pairs the language filter deliberately kept, inside the same model that transcribes 99 languages.

Just predict the next token

With the interface fixed, training is the simplest thing in deep learning. Write x\mathbf{x} for the audio (its log-Mel spectrogram) and u1,u2,u_1, u_2, \dots for the target token sequence above. The decoder is trained to predict each token from the audio and the tokens before it, and the loss is the usual cross-entropy:

L(θ)=tlogpθ ⁣(utu<t,x)\mathcal{L}(\theta) = -\sum_{t} \log p_\theta\!\left(u_t \mid u_{<t},\, \mathbf{x}\right)
(1)

That is it. No contrastive objective, no auxiliary self-supervised loss, no separate alignment model. Every task lives inside this one sum, because every task is just a different target sequence. The language token, the task token, the timestamps, and the words are all predicted the same way.

Notice what is missing: nothing in this loss knows the labels are noisy. It is the same cross-entropy you would use on gold transcripts. Weak supervision is not a special objective. The noise is handled before training, by the filtering, and absorbed during it, by scale, and the mechanism is gradient averaging: across millions of diverse examples, each transcript's mistakes are idiosyncratic and point in their own random directions, so summed across a batch they tend to cancel, while the true mapping from sound to words pushes the same way in example after example over hundreds of thousands of hours, and the model fits the part the examples agree on.

The training run is deliberately short. Whisper trains for 2202^{20} updates with a batch of 256 thirty-second segments, which is only two to three passes over the data. The optimizer is AdamW with gradient clipping and a learning rate that warms up over 2,048 steps and then decays linearly to zero. Because the run is so short and the data so large, the authors do not use data augmentation or regularization on the original models; they lean on the sheer diversity of the data to prevent overfitting. (A later Large-V2 went back for 2.5 times more epochs and added SpecAugment, stochastic depth, and BPE dropout, which is why some reported numbers improved after release.) The only special move is a brief fine-tune to stop the model from confidently guessing speakers' names, a habit it picks up because so many web transcripts label who is talking.

Robustness you cannot fine-tune for

Now the payoff, and it starts with a puzzle. Back in 2015, Deep Speech 2 reported reaching human-level accuracy on LibriSpeech test-clean, with a human error rate around 5.8% and the system at 5.3%. The authors guessed there was little room left. Seven years later the best LibriSpeech error rate had fallen to 1.4%, well below that "human" number. And yet models trained on LibriSpeech still make far more errors than a human the moment you move them to a phone call or a meeting. Superhuman on the test, subhuman in the world.

The paper's explanation is that humans and machines are being graded on different skills. A person transcribing a LibriSpeech clip got no practice on LibriSpeech; their score measures out-of-distribution generalization. A model trained on LibriSpeech's own training split is being measured on in-distribution generalization: fine-tuning teaches not just how to transcribe but LibriSpeech itself, its accents, its microphones, its label style, and scoring on that same distribution rewards exactly the memorized specificity. Same test, different ability. A model that trains broadly and is never fine-tuned on the benchmark has no one dataset's quirks to memorize, so it is graded the way a human is, and the score it earns is the kind that travels to a phone call or a noisy meeting.

To make that precise the paper borrows effective robustness from Taori et al.. Fit a line, across the many released LibriSpeech models, that predicts a model's out-of-distribution error from its error on the reference dataset. Ordinary models all sit on that line, and the line is well above the ideal where performance is equal everywhere. A model has positive effective robustness when it lands above the line: better off-distribution than its reference score predicts. The cleanest way to see it is to pin two models to the same reference score and watch what happens elsewhere.

That is exactly the comparison in the figure below. A supervised wav2vec 2.0 model and zero-shot Whisper are tied at 2.7 word error rate on LibriSpeech Clean. Then every other dataset pulls them apart, and not by a little:

Figure 4 · effective robustness

Average: wav2vec 29.3 -> Whisper 12.8 WER, a 55.2% relative error reduction.

One row per dataset. The grey dot is supervised wav2vec 2.0, the teal dot is zero-shot Whisper. On LibriSpeech Clean they sit on top of each other at 2.7. Everywhere else Whisper's dot is far to the left: a 55.2% average relative error reduction. Hover a row to read it out.

Across those datasets the supervised model averages 29.3 WER and Whisper averages 12.8, a 55.2% average relative error reduction, despite the two being indistinguishable on the reference. The effect is large enough that even the smallest Whisper, the 39M Tiny, is roughly competitive on other datasets with the best supervised LibriSpeech model, even though Tiny's own LibriSpeech score is an unremarkable 6.7. Broad training did not raise the in-distribution ceiling. It raised the floor everywhere else.

One caveat keeps these numbers honest, and the paper is upfront about it. Word error rate is (S+D+I)/N(S + D + I)/N: substitutions plus deletions plus insertions, over the number of reference words, from the smallest word-level edit between the model's output and the reference.

WER=S+D+IN\text{WER} = \frac{S + D + I}{N}
(2)

It counts every mismatch equally, so a perfectly correct transcript that writes "Mr." where the reference wrote "Mister," or splits a contraction differently, is penalized like a real error. A zero-shot model, which never saw a dataset's house style, gets hit hardest by this. Whisper therefore runs both transcripts through a text normalizer before scoring, which on some datasets drops the measured WER by up to half. That normalizer was tuned alongside Whisper, so there is a real risk it flatters the model; the authors test against an independently built normalizer and release theirs so others can check. The robustness gap survives the scrutiny, but the absolute numbers do lean on normalization.

Distribution shift across datasets is one kind of robustness. Holding up as the audio itself degrades is another. When the paper adds noise, the LibriSpeech specialists start out ahead in clean audio, because that is what they were tuned for, then collapse as the noise rises. Whisper barely moves. Under natural pub noise the specialists fall behind Whisper once the signal-to-noise ratio drops below about 10 dB:

Figure 5 · robustness to noise
10 dB

At 10 dB: Whisper 16% vs specialist 17% WER, so Whisper is ahead.

Error against signal-to-noise ratio. A LibriSpeech specialist starts below Whisper in clean audio and then climbs steeply; Whisper stays nearly flat. The lines cross near 10 dB. Drag the slider. (Curve shapes are illustrative; the ~10 dB crossover under pub noise is the paper's finding.)

How close to a human does this actually get? The paper checks directly: 25 recordings, transcribed by Whisper and by several professional services. One computer-assisted service edged Whisper by 1.15 WER points, and the purely human services beat it by only a fraction of a point. So the paper claims what the numbers support: Whisper approaches human accuracy and robustness, it does not clear the bar.

What scale actually buys

If the whole bet is on data, the obvious question is whether more of it keeps helping or saturates at the quality of the noisy labels. The paper checks both the model axis and the data axis.

On model size, zero-shot performance keeps improving as the model grows across multilingual recognition, translation, and language identification. English recognition is the exception: it flattens out, which the paper reads as bumping into human-level performance rather than a failure to scale. On data size, training medium models on subsamples from 0.5% up to the full set, every increase helps, but with diminishing returns. English error falls quickly from a few thousand hours to around 13,000, slows toward 54,000, and the final jump to the full 680,000 hours buys only about another point.

The most quotable scaling result is across languages. Plot each language's error against how many hours of it are in the training set, on log-log axes, and the points fall close to a line: a squared correlation of 0.83, with error halving for roughly every 16-fold increase in data. In symbols, error falls like a small power of the data,

WER    D1/4WER halves for every 16× more hours D\text{WER} \;\propto\; D^{-1/4} \qquad\Longleftrightarrow\qquad \text{WER halves for every } 16\times \text{ more hours } D
(3)

so the slope on log-log axes is 1/4-1/4. Drag through it:

Figure 6 · the multilingual scaling law
102 -> 1.6k h

102 h predicts about 33.6% WER; 1.6k h (16x more) halves it to 16.8%.

Per-language error versus hours of that language, log-log. The fit line halves the error for every 16x more data. The outliers above the line, worse than predicted, all have unfamiliar scripts: Hebrew, Telugu, Chinese, Korean. The slope is the paper's measured result; the scatter is representative.

The languages that fall furthest above the line, worse than their data would predict, share a trait: unique scripts and distance from the Indo-European languages that dominate the training set, like Hebrew, Telugu, Chinese, and Korean. The relationship also points at the fix. Because error is so well predicted by data, the way to help a weak language is to collect more of it.

One worry about cramming 99 languages and several tasks into one model is negative transfer, where the jobs interfere and each ends up worse than a dedicated model would be. The paper finds exactly that, but only at small scale. Small joint models do underperform English-only models trained on the same compute. Large joint models flip the sign: they show positive transfer and slightly beat their English-only counterparts, even though only about 65% of the joint model's compute goes to English at all. Sharing helps, once the model is big enough to afford it.

Not every result is a win, and the paper says so. Translation does reach a new zero-shot state of the art of 29.1 BLEU on CoVoST2 X→en, which it credits to having 68,000 hours of translation data for those languages against the benchmark's 861. Language identification, by contrast, is not competitive: Whisper trails the supervised state of the art by 13.6%, partly because 20 of the benchmark's 102 languages have no training data at all and cap its accuracy. Multilingual recognition is uneven too, strong on some benchmarks and weak on others. Scale buys a lot here, but it does not buy everything evenly.

Where it still breaks

The remaining failures are revealing because they are not about hearing. Larger Whisper models steadily fix perception errors, like confusing similar-sounding words. What is left looks like the failure modes of a language model bolted to audio: getting stuck in a repeat loop, dropping the first or last few words of a chunk, or, worst, hallucinating a fluent transcript that has nothing to do with the audio. These come from the decoder being a language model, which is also what makes it fluent.

They bite hardest in long-form audio, because Whisper only ever sees 30 seconds at a time and has to be slid across a long recording, where one bad window can derail the next. The fix in the paper is not a new model but a set of decoding heuristics: beam search, plus a temperature that is bumped up only when a segment looks broken (its average log-probability is too low or its text compresses too well, a sign of repetition), plus a gate that calls a window silent only when the no-speech token and the log-probability agree.

# long-form: slide a 30s window; fall back when a segment looks broken
temp = 0.0                          # start greedy, low temperature
while audio_left:
    seg = beam_search(window, beams=5, temperature=temp)
    broken = seg.avg_logprob < -1.0 or gzip_ratio(seg.text) > 2.4
    if broken and temp < 1.0:
        temp += 0.2                 # retry hotter to break a repeat loop
        continue
    if seg.nospeech_prob > 0.6 and seg.avg_logprob < -1.0:
        seg = SILENCE               # voice-activity gate
    emit(seg)
    window = shift_to(seg.last_timestamp)   # move by predicted time
    temp = 0.0

It is worth seeing why each guard exists, because each one catches a different way the decoder fails, and the root cause is the same in every case: the decoder is a language model, trained to produce fluent text, so when the audio gives it little to anchor on it will happily write something fluent and wrong rather than stop. The first guard is the compression ratio. Run the decoded text through gzip and compare its compressed size to its raw size; ordinary speech compresses by a modest amount, but a transcript stuck in a loop, thank you thank you thank you, is almost all repetition and compresses far more, so a gzip ratio above a threshold (the code uses 2.4) is a reliable fingerprint of a repeat loop the audio never contained. The second guard is the average log-probability, the model's own mean confidence per token across the segment; a fluent hallucination unmoored from the audio tends to be low-confidence throughout, so an average that falls below a threshold (here −1.0) flags a window where the model is guessing rather than transcribing. The third is not a detector but a remedy, temperature fallback: when a window trips either flag, the decoder re-runs it with the sampling temperature bumped up. Greedy decoding can wedge itself into a single repeating high-probability cycle and never leave it; adding randomness perturbs the next-token choice enough to knock the decoder out of that cycle and let it find a different, correct continuation. The guards compound on long audio because Whisper sees only 30 seconds at a time and conditions each window on the previous segment's text, so one undetected bad window can feed its garbage forward as context and derail the next; catching it early is what stops the error from propagating.

Walk the temperature slider in the figure to see all three regimes in one place: at low tau the transcript falls into the repeat loop and the gzip ratio climbs past 2.4; at moderate tau a little jitter knocks the decoder out of the cycle and the ratio sits under the threshold; at high tau the ratio is still under the threshold but the tokens are fluent gibberish, the failure mode the ratio guard cannot see.

Figure 7 · temperature fallback breaks repeat loops
0.0

tau = 0.0: greedy decode wedges into a repeat loop around 8 s; the gzip ratio climbs past 2.4 and the decoder retries the same 30 s window at tau = 0.2. Paper-stated: the tau steps of 0.2 up to 1.0, the 2.4 ratio threshold. Schematic: the exact tokens, the precise step at which the ratio crosses 2.4, and the curve shape.

Schematic of the long-form decoding guard. The top panel draws the transcript over a 30 s window as a track of token chips; the bottom panel draws the gzip compression ratio of the transcript-so-far, with the 2.4 threshold marked. At tau = 0.0 a clean opening collapses into thank you thank you thank you, the ratio crosses 2.4 partway through, and the decoder retries the same 30 s window at tau = 0.2. Around tau = 0.4 to 0.6 the decoder recovers; at tau = 1.0 it hallucinates without ever tripping the gzip guard. The 2.4 threshold and the 0.2 tau steps are paper-stated; the exact tokens and the precise crossing step are schematic, drawn for teaching.

The other clear limit is low-resource languages, which Figure 6 already diagnosed. Whisper's data is heavily English, so most languages get under a thousand hours and the error rates show it. The remedies the paper lists are unglamorous and direct: collect more data for rare languages, study fine-tuning for domains where labeled data exists, add reinforcement learning to optimize the decoder for the metric you care about, and probe how much of the robustness comes from the decoder behaving like a language model.

Step back and the argument is short. Unlabeled audio scales but cannot write; clean transcripts can write but do not scale; noisy web transcripts do both, weakly. Train one plain Transformer on a huge, diverse pile of them to predict the text, fold every task into the tokens it already predicts, and grade it zero-shot. What comes out is not a higher score on the old benchmark. It is a recognizer that holds up when the benchmark changes, which is the thing the old recipe could never fine-tune its way into.

Provenance Verified against primary literature
wav2vec 2.0 / BigSSLSelf-supervised pre-training that still needs fine-tuning to read out.
Taori et al. (2020)Effective robustness: out-of-distribution score vs what the reference predicts.
Vaswani et al. (2017)An off-the-shelf encoder-decoder Transformer, chosen on purpose.
WER (edit distance)The error metric, normalized before scoring to forgive formatting.
correctionWhisper is often miscalled self-supervised. It uses no unsupervised pre-training; the supervision is real, if noisy, web transcripts. We keep that distinction sharp. "Zero-shot" here also means no fine-tuning on the test set, not in-context examples.

Questions you might still have

?

Is Whisper self-supervised, like wav2vec 2.0?
No. wav2vec learns from unlabeled audio and needs a fine-tuning stage to transcribe. Whisper trains on real (audio, transcript) pairs scraped from the web, weak supervision, and transcribes zero-shot with no fine-tuning.

?

What does "zero-shot" mean here?
No fine-tuning on the benchmark, not in-context examples like GPT-3. Whisper never sees any of an evaluation dataset’s training split, so the score measures broad generalization.

?

If Whisper only ties wav2vec on LibriSpeech, why is it better?
Effective robustness. Matched at 2.7 WER on the reference, Whisper makes about half the errors on every other dataset, a 55.2% average relative error reduction. Narrow training buys in-distribution accuracy that does not carry over.

?

Why train one model on transcription, translation, timestamps, and language ID at once?
The decoder is a language model, so every task is a token sequence on the same interface. Small joint models suffer negative transfer, but large ones show positive transfer and a single model replaces a whole pipeline.

?

Does Whisper actually beat human transcribers?
It is close, not clearly past. On 25 recordings a computer-assisted service beat it by 1.15 WER points and pure-human services by a fraction of a point. The claim is "approaches human accuracy and robustness," not "exceeds."

Footnotes & further reading

  1. The paper: Radford, Kim, Xu, Brockman, McLeavey, Sutskever, Robust Speech Recognition via Large-Scale Weak Supervision (OpenAI, 2022). Code and models.
  2. The self-supervised approach Whisper contrasts itself with: Baevski et al., wav2vec 2.0, and the million-hour scaling of it in Zhang et al., BigSSL.
  3. Effective robustness, the lens for the LibriSpeech-versus-everything-else comparison: Taori et al., Measuring Robustness to Natural Distribution Shifts in Image Classification (2020).
  4. The architecture, used off the shelf: Vaswani et al., Attention Is All You Need, with the byte-level BPE tokenizer of Sennrich et al. as reused in GPT-2.
  5. The robustness puzzle it answers begins with Amodei et al., Deep Speech 2, which first claimed human parity on LibriSpeech test-clean.