Distil-Whisper Large-v2

Property	Value
Parameter Count	756M
Model Type	Speech Recognition
License	MIT
Paper	Robust Knowledge Distillation via Large-Scale Pseudo Labelling
Tensor Type	FP16

What is distil-large-v2?

Distil-large-v2 is a highly optimized speech recognition model that achieves remarkable efficiency gains while maintaining near-identical accuracy to OpenAI's Whisper large-v2. Through innovative knowledge distillation techniques, it delivers 6x faster inference speed while being 49% smaller than the original model. The model is specifically designed for English speech recognition tasks and performs within 1% WER (Word Error Rate) of its teacher model.

Implementation Details

The model employs an encoder-decoder architecture where the encoder is inherited directly from Whisper and remains frozen during training. The key innovation lies in the decoder, which is reduced to just two layers initialized from the first and last decoder layers of the teacher model. This architectural optimization, combined with training on 22,000 hours of diverse audio data, enables both efficiency and robustness.

6x faster inference compared to Whisper large-v2
49% reduction in model size (756M parameters)
Supports both short-form and long-form audio transcription
Optimized for batch processing and streaming inference

Core Capabilities

High-accuracy English speech recognition
Efficient processing of both short (<30s) and long-form audio
Support for Flash Attention 2 and BetterTransformer optimizations
Compatible with multiple frameworks including Transformers.js and Whisper.cpp

Frequently Asked Questions

Q: What makes this model unique?

The model's unique value proposition lies in its ability to maintain near-identical accuracy to Whisper while delivering substantial speed improvements through innovative distillation techniques and architecture optimization.

Q: What are the recommended use cases?

The model is ideal for English speech recognition tasks requiring both accuracy and speed, particularly in production environments where computational efficiency is crucial. It excels in both short-form and long-form audio transcription scenarios.

distil-large-v2