NVIDIA FastConformer-Hybrid Large (Uzbek)
Property | Value |
---|---|
Parameter Count | 115M |
License | CC-BY-4.0 |
Architecture | FastConformer-Transducer CTC |
Paper | Fast Conformer Paper |
WER (Common Voice) | 16.46% |
What is stt_uz_fastconformer_hybrid_large_pc?
This is a state-of-the-art speech recognition model specifically designed for the Uzbek language. It's a hybrid model that combines Transducer and CTC losses, built on the FastConformer architecture with 115M parameters. The model processes 16kHz mono-channel audio and outputs transcribed text in both upper and lower case Uzbek alphabet.
Implementation Details
The model is implemented using NVIDIA's NeMo toolkit and leverages an optimized version of the Conformer architecture with 8x depthwise-separable convolutional downsampling. It was trained on approximately 1000 hours of Uzbek speech data from multiple sources including Mozilla Common Voice, UzbekVoice, and Google FLEURS.
- Hybrid architecture combining Transducer and CTC losses
- Trained on 1000 hours of diverse Uzbek speech data
- Supports 16kHz mono-channel audio input
- Achieves 16.46% WER on Common Voice test set
Core Capabilities
- Transcribes Uzbek speech to text with high accuracy
- Handles various speech patterns and accents
- Supports both streaming and batch processing
- Easy integration with NeMo toolkit for inference or fine-tuning
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its hybrid architecture combining Transducer and CTC losses, specifically optimized for Uzbek language processing. The large-scale training data and advanced FastConformer architecture enable superior performance in real-world applications.
Q: What are the recommended use cases?
The model is ideal for Uzbek speech transcription tasks, including automated transcription services, voice assistants, and speech analytics applications. It's particularly suitable for scenarios requiring high accuracy in Uzbek language processing.