distil-large-v3

distil-large-v3

distil-whisper

Distil-large-v3 is a 756M parameter English speech recognition model, 6.3x faster than Whisper large-v3 with comparable accuracy and optimized for long-form transcription.

PropertyValue
Parameter Count756M
Model TypeSpeech Recognition
LicenseMIT
PaperDistil-Whisper Paper
Relative Speed6.3x faster than Whisper large-v3

What is distil-large-v3?

Distil-large-v3 is a knowledge-distilled version of OpenAI's Whisper large-v3 model, designed specifically for English speech recognition. It achieves comparable accuracy while being significantly faster, making it ideal for production environments. The model was trained on 22,000 hours of diverse audio data from nine open-source datasets, ensuring robustness across different domains and speaking styles.

Implementation Details

The model utilizes an encoder-decoder architecture, with the encoder retained from the original Whisper model and a reduced decoder for improved efficiency. It supports both sequential and chunked long-form transcription, with specialized optimizations for 30-second context windows.

  • Optimized for both short-form and long-form audio transcription
  • Supports multiple inference backends including Flash Attention 2 and PyTorch SDPA
  • Compatible with popular frameworks like Whisper.cpp, Faster-Whisper, and Transformers.js

Core Capabilities

  • Achieves WER within 1% of Whisper large-v3
  • 6.3x faster inference speed compared to the original model
  • Robust performance across different audio domains
  • Supports speculative decoding for 2x speed improvement
  • Optimized for both CPU and GPU inference

Frequently Asked Questions

Q: What makes this model unique?

The model's key innovation is its ability to maintain Whisper's accuracy while significantly reducing computational requirements through targeted knowledge distillation and architecture optimization.

Q: What are the recommended use cases?

It's ideal for production environments requiring fast, accurate English speech recognition, particularly for both short-form and long-form audio transcription tasks. It's especially suitable for applications where computational efficiency is crucial.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026