Recent advances in self-supervised pre-training have led to impressive gains in speech recognition performance. Models like Whisper from achieve state-of-the-art results by pre-training on massive amounts of unlabeled audio data. However, the scale of these models makes them slow and computationally expensive to run.
In a new paper, researchers from Hugging Face demonstrate how knowledge distillation can be used to compress Whisper models without significantly impacting accuracy. Through a technique called pseudo-labeling, the authors generate training data by running audio through the original Whisper model to get predicted transcriptions. They then train a smaller “student” model to mimic the outputs of the larger “teacher” on this new dataset.
The distilled models, dubbed Distil-Whisper, achieve a 5.8x speedup and 51% parameter reduction compared to Whisper with only a 1% drop in word error rate on test data. The student models maintain robustness to diverse acoustic conditions and even outperform Whisper on long-form audio by reducing hallucination errors.
Speculative decoding allows Distil-Whisper to be used as an assistant to accelerate Whisper inference another 2x while ensuring identical outputs. The paper demonstrates distillation is a viable method for compressing large speech recognition models for deployment.
The ability to shrink models like Whisper has important implications. It enables running these high-accuracy speech recognizers on edge devices with lower compute. Distil-Whisper also reduces the cost of deploying speech services, as fewer resources are needed. Future work could explore distilling models to even smaller sizes without compromising performance. Overall, this paper shows knowledge distillation will be a key technique for delivering the benefits of giant pre-trained models at scale.