wav2vec2-mbart50-ru

Property	Value
License	Apache 2.0
Architecture	Speech Encoder-Decoder
Primary Task	Russian Speech Recognition
Author	Ivan Bondarenko

What is wav2vec2-mbart50-ru?

wav2vec2-mbart50-ru is an advanced speech-to-text model specifically designed for Russian language processing. It combines Wav2Vec2-Large-Ru-Golos as the encoder and mBART-large-50 as the decoder, creating a powerful speech recognition system that can handle not just basic transcription but also proper punctuation and capitalization.

Implementation Details

The model was trained on multiple Russian speech datasets, including SberDevices Golos, Common Voice 6.0, Sova RuDevices, and Russian Librispeech. It requires 16kHz audio input and demonstrates impressive Word Error Rates (WER) ranging from 13.2% to 32.5% across different test sets.

Utilizes SpeechEncoderDecoderModel architecture
Trained on 5 different Russian speech datasets
Supports automatic punctuation and capitalization
Processes 16kHz audio input

Core Capabilities

Accurate Russian speech recognition with WER as low as 13.2%
Automatic text enhancement with proper punctuation
Handles various speech conditions (crowd, farfield)
Production-ready with PyTorch implementation

Frequently Asked Questions

Q: What makes this model unique?

The model's unique strength lies in its combination of Wav2Vec2 and mBART-50 architectures, allowing it to not only transcribe speech but also add proper punctuation and capitalization automatically. It's been trained on diverse Russian speech datasets, making it robust across different speaking conditions.

Q: What are the recommended use cases?

The model is ideal for Russian speech transcription tasks requiring high accuracy and proper formatting. It's particularly effective for applications in crowd-sourced audio, farfield recordings, and general speech recognition scenarios where proper punctuation and capitalization are important.