Distil-Whisper Small English Model
Property | Value |
---|---|
Parameter Count | 166M |
License | MIT |
Paper | Robust Knowledge Distillation via Large-Scale Pseudo Labelling |
Tensor Type | FP16 |
What is distil-small.en?
Distil-small.en is a highly optimized speech recognition model that represents a significant breakthrough in efficient AI processing. As the smallest checkpoint in the Distil-Whisper family, it achieves remarkable performance with just 166M parameters while being 6 times faster than the original Whisper model. The model maintains accuracy within 1% WER (Word Error Rate) of larger models, making it ideal for resource-constrained environments.
Implementation Details
The model utilizes an encoder-decoder architecture inherited from Whisper, with specific optimizations for speed and efficiency. It features four decoder layers optimized for balancing speed and accuracy, supporting both short-form (< 30 seconds) and long-form audio transcription with chunked processing capabilities.
- Supports Flash Attention 2 for enhanced GPU performance
- Implements efficient chunked processing for long audio files
- Compatible with multiple platforms including browser-based deployment via Transformers.js
- Trained on 22,000 hours of diverse audio data from 9 open-source datasets
Core Capabilities
- Fast transcription: 5.6x faster than original Whisper
- Efficient memory usage: Only 166M parameters
- High accuracy: Within 3% WER of Whisper large-v2
- Supports both short and long-form audio processing
- Built-in chunked algorithm for efficient long-form transcription
Frequently Asked Questions
Q: What makes this model unique?
The model's primary strength lies in its optimal balance between size and performance. It achieves near-original Whisper accuracy while being significantly smaller and faster, making it perfect for deployment in resource-constrained environments like mobile devices or edge computing scenarios.
Q: What are the recommended use cases?
The model is ideal for real-time transcription tasks, mobile applications, and scenarios where computational resources are limited. It's particularly well-suited for short-form audio processing and can handle long-form content through its efficient chunked processing algorithm.