distil-large-v3

distil-whisper

Distil-large-v3 is a 756M parameter English speech recognition model, 6.3x faster than Whisper large-v3 with comparable accuracy and optimized for long-form transcription.

Property	Value
Parameter Count	756M
Model Type	Speech Recognition
License	MIT
Paper	Distil-Whisper Paper
Relative Speed	6.3x faster than Whisper large-v3

What is distil-large-v3?

Distil-large-v3 is a knowledge-distilled version of OpenAI's Whisper large-v3 model, designed specifically for English speech recognition. It achieves comparable accuracy while being significantly faster, making it ideal for production environments. The model was trained on 22,000 hours of diverse audio data from nine open-source datasets, ensuring robustness across different domains and speaking styles.

Implementation Details

The model utilizes an encoder-decoder architecture, with the encoder retained from the original Whisper model and a reduced decoder for improved efficiency. It supports both sequential and chunked long-form transcription, with specialized optimizations for 30-second context windows.

Optimized for both short-form and long-form audio transcription
Supports multiple inference backends including Flash Attention 2 and PyTorch SDPA
Compatible with popular frameworks like Whisper.cpp, Faster-Whisper, and Transformers.js

Core Capabilities

Achieves WER within 1% of Whisper large-v3
6.3x faster inference speed compared to the original model
Robust performance across different audio domains
Supports speculative decoding for 2x speed improvement
Optimized for both CPU and GPU inference

Frequently Asked Questions

Q: What makes this model unique?

The model's key innovation is its ability to maintain Whisper's accuracy while significantly reducing computational requirements through targeted knowledge distillation and architecture optimization.

Q: What are the recommended use cases?

It's ideal for production environments requiring fast, accurate English speech recognition, particularly for both short-form and long-form audio transcription tasks. It's especially suitable for applications where computational efficiency is crucial.