Distil-Whisper Small English Model

Property	Value
Parameter Count	166M
License	MIT
Paper	Robust Knowledge Distillation via Large-Scale Pseudo Labelling
Tensor Type	FP16

What is distil-small.en?

Distil-small.en is a highly optimized speech recognition model that represents a significant breakthrough in efficient AI processing. As the smallest checkpoint in the Distil-Whisper family, it achieves remarkable performance with just 166M parameters while being 6 times faster than the original Whisper model. The model maintains accuracy within 1% WER (Word Error Rate) of larger models, making it ideal for resource-constrained environments.

Implementation Details

The model utilizes an encoder-decoder architecture inherited from Whisper, with specific optimizations for speed and efficiency. It features four decoder layers optimized for balancing speed and accuracy, supporting both short-form (< 30 seconds) and long-form audio transcription with chunked processing capabilities.

Supports Flash Attention 2 for enhanced GPU performance
Implements efficient chunked processing for long audio files
Compatible with multiple platforms including browser-based deployment via Transformers.js
Trained on 22,000 hours of diverse audio data from 9 open-source datasets

Core Capabilities

Fast transcription: 5.6x faster than original Whisper
Efficient memory usage: Only 166M parameters
High accuracy: Within 3% WER of Whisper large-v2
Supports both short and long-form audio processing
Built-in chunked algorithm for efficient long-form transcription

Frequently Asked Questions

Q: What makes this model unique?

The model's primary strength lies in its optimal balance between size and performance. It achieves near-original Whisper accuracy while being significantly smaller and faster, making it perfect for deployment in resource-constrained environments like mobile devices or edge computing scenarios.

Q: What are the recommended use cases?

The model is ideal for real-time transcription tasks, mobile applications, and scenarios where computational resources are limited. It's particularly well-suited for short-form audio processing and can handle long-form content through its efficient chunked processing algorithm.

distil-small.en