Distil-Whisper Large-v2
Property | Value |
---|---|
Parameter Count | 756M |
Model Type | Speech Recognition |
License | MIT |
Paper | Robust Knowledge Distillation via Large-Scale Pseudo Labelling |
Tensor Type | FP16 |
What is distil-large-v2?
Distil-large-v2 is a highly optimized speech recognition model that achieves remarkable efficiency gains while maintaining near-identical accuracy to OpenAI's Whisper large-v2. Through innovative knowledge distillation techniques, it delivers 6x faster inference speed while being 49% smaller than the original model. The model is specifically designed for English speech recognition tasks and performs within 1% WER (Word Error Rate) of its teacher model.
Implementation Details
The model employs an encoder-decoder architecture where the encoder is inherited directly from Whisper and remains frozen during training. The key innovation lies in the decoder, which is reduced to just two layers initialized from the first and last decoder layers of the teacher model. This architectural optimization, combined with training on 22,000 hours of diverse audio data, enables both efficiency and robustness.
- 6x faster inference compared to Whisper large-v2
- 49% reduction in model size (756M parameters)
- Supports both short-form and long-form audio transcription
- Optimized for batch processing and streaming inference
Core Capabilities
- High-accuracy English speech recognition
- Efficient processing of both short (<30s) and long-form audio
- Support for Flash Attention 2 and BetterTransformer optimizations
- Compatible with multiple frameworks including Transformers.js and Whisper.cpp
Frequently Asked Questions
Q: What makes this model unique?
The model's unique value proposition lies in its ability to maintain near-identical accuracy to Whisper while delivering substantial speed improvements through innovative distillation techniques and architecture optimization.
Q: What are the recommended use cases?
The model is ideal for English speech recognition tasks requiring both accuracy and speed, particularly in production environments where computational efficiency is crucial. It excels in both short-form and long-form audio transcription scenarios.