distil-medium.en

Maintained By
distil-whisper

Distil-Whisper Medium.en

PropertyValue
Parameter Count394M
LicenseMIT
PaperDistil-Whisper Paper
Tensor TypeFP16

What is distil-medium.en?

Distil-medium.en is a highly optimized English speech recognition model that demonstrates the power of knowledge distillation. As a compressed version of Whisper medium.en, it achieves remarkable efficiency gains while maintaining near-identical accuracy. The model is 6 times faster and 49% smaller than its teacher model, yet performs within 1% WER on out-of-distribution evaluation sets.

Implementation Details

The model employs an encoder-decoder architecture with a unique distillation approach. The encoder is copied directly from the teacher model and frozen during training, while the decoder is significantly compressed to just two layers, initialized from the first and last decoder layers of the teacher model. This architecture was trained on 22,000 hours of diverse audio data from 9 open-source datasets.

  • Supports both short-form (<30s) and long-form audio transcription
  • Implements Flash Attention 2 for enhanced GPU performance
  • Compatible with multiple frameworks including Transformers.js and Whisper.cpp

Core Capabilities

  • Achieves 11.1% WER on short-form and 12.4% WER on long-form audio
  • Supports chunked processing for efficient long-form transcription
  • Can be used as an assistant model for speculative decoding
  • Offers multiple optimization options including 8-bit and 4-bit quantization

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to maintain accuracy while significantly reducing computational requirements through innovative distillation techniques makes it stand out. It's particularly noteworthy for achieving 6x faster inference while keeping performance within 1% WER of the original model.

Q: What are the recommended use cases?

The model is ideal for English speech recognition tasks, particularly in scenarios requiring real-time or efficient processing. It's especially suitable for both short-form and long-form audio transcription, making it versatile for applications ranging from meeting transcription to podcast subtitling.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.