Distil-Whisper Medium.en

Property	Value
Parameter Count	394M
License	MIT
Paper	Distil-Whisper Paper
Tensor Type	FP16

What is distil-medium.en?

Distil-medium.en is a highly optimized English speech recognition model that demonstrates the power of knowledge distillation. As a compressed version of Whisper medium.en, it achieves remarkable efficiency gains while maintaining near-identical accuracy. The model is 6 times faster and 49% smaller than its teacher model, yet performs within 1% WER on out-of-distribution evaluation sets.

Implementation Details

The model employs an encoder-decoder architecture with a unique distillation approach. The encoder is copied directly from the teacher model and frozen during training, while the decoder is significantly compressed to just two layers, initialized from the first and last decoder layers of the teacher model. This architecture was trained on 22,000 hours of diverse audio data from 9 open-source datasets.

Supports both short-form (<30s) and long-form audio transcription
Implements Flash Attention 2 for enhanced GPU performance
Compatible with multiple frameworks including Transformers.js and Whisper.cpp

Core Capabilities

Achieves 11.1% WER on short-form and 12.4% WER on long-form audio
Supports chunked processing for efficient long-form transcription
Can be used as an assistant model for speculative decoding
Offers multiple optimization options including 8-bit and 4-bit quantization

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to maintain accuracy while significantly reducing computational requirements through innovative distillation techniques makes it stand out. It's particularly noteworthy for achieving 6x faster inference while keeping performance within 1% WER of the original model.

Q: What are the recommended use cases?

The model is ideal for English speech recognition tasks, particularly in scenarios requiring real-time or efficient processing. It's especially suitable for both short-form and long-form audio transcription, making it versatile for applications ranging from meeting transcription to podcast subtitling.

distil-medium.en