asr-whisper-large-v2-commonvoice-mn

speechbrain

SpeechBrain's Whisper Large-v2 model fine-tuned for Mongolian ASR, achieving 64.92% WER on CommonVoice test set, with frozen encoder and fine-tuned decoder architecture.

Property	Value
License	Apache 2.0
Framework	PyTorch / SpeechBrain
Test WER	64.92%
Test CER	25.73%

What is asr-whisper-large-v2-commonvoice-mn?

This is a specialized automatic speech recognition (ASR) model designed for the Mongolian language, built on OpenAI's Whisper Large-v2 architecture and fine-tuned using the CommonVoice dataset. The model represents a significant effort in expanding language support for ASR technology to less-common languages.

Implementation Details

The model employs a sophisticated architecture where the pretrained Whisper-large-v2 encoder remains frozen while the decoder is fine-tuned specifically for Mongolian speech recognition. It utilizes the original Whisper tokenizer and processes audio at 16kHz sampling rate with single-channel input.

Frozen pretrained Whisper-large-v2 encoder
Fine-tuned decoder architecture
Integrated Whisper tokenizer
Automatic audio normalization capabilities

Core Capabilities

Mongolian speech recognition with 64.92% WER
Automatic audio preprocessing and normalization
GPU-compatible inference
Support for 16kHz audio processing

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized focus on Mongolian language ASR, utilizing the powerful Whisper Large-v2 architecture while maintaining the original encoder's knowledge through freezing, allowing for efficient fine-tuning on the target language.

Q: What are the recommended use cases?

The model is specifically designed for Mongolian speech recognition tasks, ideal for applications requiring transcription of Mongolian audio content. It's particularly suitable for scenarios where 16kHz audio input is available and GPU resources can be utilized for inference.