NVIDIA FastConformer-Hybrid Large (Uzbek)
Property | Value |
---|---|
Parameter Count | 115M |
License | CC-BY-4.0 |
Architecture | FastConformer-Transducer CTC |
Paper | Fast Conformer Paper |
WER (Common Voice) | 16.46% |
What is stt_uz_fastconformer_hybrid_large_pc?
This is a state-of-the-art automatic speech recognition (ASR) model specifically designed for the Uzbek language. It's built on NVIDIA's FastConformer architecture, combining both Transducer and CTC approaches for robust speech recognition. The model has been trained on a diverse dataset of 1000 hours of Uzbek speech, including Common Voice, UzbekVoice, and Fleurs datasets.
Implementation Details
The model utilizes an optimized FastConformer architecture with 8x depthwise-separable convolutional downsampling. It's implemented using the NVIDIA NeMo toolkit and accepts 16kHz mono-channel audio as input. The hybrid approach combines Transducer (primary) and CTC losses during training for improved performance.
- 115M trainable parameters
- Supports both Transducer and CTC inference modes
- Processes 16kHz mono-channel WAV files
- Trained on 1000 hours of diverse Uzbek speech data
Core Capabilities
- Transcribes Uzbek speech with high accuracy (16.46% WER on Common Voice)
- Handles both upper and lower case Uzbek alphabet
- Supports punctuation including spaces, commas, question marks, and dashes
- Easy integration with NeMo toolkit for inference or fine-tuning
Frequently Asked Questions
Q: What makes this model unique?
This model combines FastConformer architecture with a hybrid Transducer-CTC approach, specifically optimized for Uzbek language. The large-scale training on 1000 hours of diverse Uzbek speech data makes it particularly robust for real-world applications.
Q: What are the recommended use cases?
The model is ideal for Uzbek speech transcription tasks in various domains. It's particularly suitable for applications requiring high accuracy in general speech recognition, though performance might vary with technical terms or specialized vocabulary not present in the training data.