distil-small.en

Maintained By
distil-whisper

Distil-Whisper Small English Model

PropertyValue
Parameter Count166M
LicenseMIT
PaperRobust Knowledge Distillation via Large-Scale Pseudo Labelling
Tensor TypeFP16

What is distil-small.en?

Distil-small.en is a highly optimized speech recognition model that represents a significant breakthrough in efficient AI processing. As the smallest checkpoint in the Distil-Whisper family, it achieves remarkable performance with just 166M parameters while being 6 times faster than the original Whisper model. The model maintains accuracy within 1% WER (Word Error Rate) of larger models, making it ideal for resource-constrained environments.

Implementation Details

The model utilizes an encoder-decoder architecture inherited from Whisper, with specific optimizations for speed and efficiency. It features four decoder layers optimized for balancing speed and accuracy, supporting both short-form (< 30 seconds) and long-form audio transcription with chunked processing capabilities.

  • Supports Flash Attention 2 for enhanced GPU performance
  • Implements efficient chunked processing for long audio files
  • Compatible with multiple platforms including browser-based deployment via Transformers.js
  • Trained on 22,000 hours of diverse audio data from 9 open-source datasets

Core Capabilities

  • Fast transcription: 5.6x faster than original Whisper
  • Efficient memory usage: Only 166M parameters
  • High accuracy: Within 3% WER of Whisper large-v2
  • Supports both short and long-form audio processing
  • Built-in chunked algorithm for efficient long-form transcription

Frequently Asked Questions

Q: What makes this model unique?

The model's primary strength lies in its optimal balance between size and performance. It achieves near-original Whisper accuracy while being significantly smaller and faster, making it perfect for deployment in resource-constrained environments like mobile devices or edge computing scenarios.

Q: What are the recommended use cases?

The model is ideal for real-time transcription tasks, mobile applications, and scenarios where computational resources are limited. It's particularly well-suited for short-form audio processing and can handle long-form content through its efficient chunked processing algorithm.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.