whisper-large-v3-russian

Property	Value
Parameter Count	1.54B
Model Type	Automatic Speech Recognition
Tensor Type	BF16
Language	Russian

What is whisper-large-v3-russian?

whisper-large-v3-russian is a specialized Russian language speech recognition model, fine-tuned from OpenAI's Whisper Large V3. This model represents a significant improvement in Russian ASR, reducing the Word Error Rate (WER) from 9.84 to 6.39 on the Common Voice 17.0 dataset. The model was extensively trained for over 60 hours on dual Tesla A100 80GB GPUs, making it particularly well-suited for Russian speech recognition tasks.

Implementation Details

The model is built upon the Whisper architecture and has been specifically optimized for Russian language processing. It utilizes the Common Voice 17.0 Russian dataset, comprising over 200,000 entries, with a 95/5 split for training and testing (225,761/11,883 rows). The model implements BF16 precision and is compatible with various hardware configurations, including CPU, CUDA, and MPS.

Built on Whisper Large V3 architecture with 1.54B parameters
Optimized for Russian language processing
Supports audio chunking with 30-second segments
Includes timestamp generation capabilities
Compatible with flash attention 2 for supported GPUs

Core Capabilities

High-accuracy Russian speech recognition
Optimized for phone call transcription
Batch processing support with customizable chunk sizes
Flexible deployment options across different computing platforms
Advanced audio preprocessing support for optimal recognition

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized optimization for Russian language processing, achieving a significantly lower WER compared to the base Whisper V3 model. Its extensive training on the Common Voice dataset makes it particularly effective for real-world Russian speech recognition tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for phone call transcription, general Russian speech recognition, and applications requiring high-accuracy transcription. It's recommended to use audio preprocessing for optimal results, especially for telephone audio.