distil-large-v2

Maintained By
distil-whisper

Distil-Whisper Large-v2

PropertyValue
Parameter Count756M
Model TypeSpeech Recognition
LicenseMIT
PaperRobust Knowledge Distillation via Large-Scale Pseudo Labelling
Tensor TypeFP16

What is distil-large-v2?

Distil-large-v2 is a highly optimized speech recognition model that achieves remarkable efficiency gains while maintaining near-identical accuracy to OpenAI's Whisper large-v2. Through innovative knowledge distillation techniques, it delivers 6x faster inference speed while being 49% smaller than the original model. The model is specifically designed for English speech recognition tasks and performs within 1% WER (Word Error Rate) of its teacher model.

Implementation Details

The model employs an encoder-decoder architecture where the encoder is inherited directly from Whisper and remains frozen during training. The key innovation lies in the decoder, which is reduced to just two layers initialized from the first and last decoder layers of the teacher model. This architectural optimization, combined with training on 22,000 hours of diverse audio data, enables both efficiency and robustness.

  • 6x faster inference compared to Whisper large-v2
  • 49% reduction in model size (756M parameters)
  • Supports both short-form and long-form audio transcription
  • Optimized for batch processing and streaming inference

Core Capabilities

  • High-accuracy English speech recognition
  • Efficient processing of both short (<30s) and long-form audio
  • Support for Flash Attention 2 and BetterTransformer optimizations
  • Compatible with multiple frameworks including Transformers.js and Whisper.cpp

Frequently Asked Questions

Q: What makes this model unique?

The model's unique value proposition lies in its ability to maintain near-identical accuracy to Whisper while delivering substantial speed improvements through innovative distillation techniques and architecture optimization.

Q: What are the recommended use cases?

The model is ideal for English speech recognition tasks requiring both accuracy and speed, particularly in production environments where computational efficiency is crucial. It excels in both short-form and long-form audio transcription scenarios.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.