wav2vec2-base-vietnamese-250h
Property | Value |
---|---|
Parameters | 95M |
License | CC BY-NC 4.0 |
Author | nguyenvulebinh |
Training Data | 13k hours pre-training, 250 hours fine-tuning |
What is wav2vec2-base-vietnamese-250h?
wav2vec2-base-vietnamese-250h is a state-of-the-art Vietnamese speech recognition model based on Facebook's wav2vec 2.0 architecture. The model was pre-trained on 13,000 hours of unlabeled Vietnamese YouTube audio and fine-tuned on 250 hours of labeled VLSP ASR dataset data. It achieves impressive Word Error Rates (WER) of 6.15% on the VIVOS dataset and 11.52% on Common Voice VI when combined with a 4-gram language model.
Implementation Details
The model utilizes the wav2vec 2.0 architecture and Connectionist Temporal Classification (CTC) for fine-tuning. It processes 16kHz sampled speech audio and functions as an acoustic model that can be enhanced with an optional 4-gram language model for improved accuracy.
- Pre-trained on 13k hours of unlabeled Vietnamese audio
- Fine-tuned on 250 hours of labeled VLSP ASR data
- Supports audio input sampled at 16kHz
- Optimized for audio segments under 10 seconds
Core Capabilities
- Achieves 6.15% WER on VIVOS dataset with language model
- 11.52% WER on Common Voice VI
- Supports end-to-end speech recognition
- Can be used with or without the 4-gram language model
Frequently Asked Questions
Q: What makes this model unique?
This model is the first Vietnamese speech recognition system that achieves state-of-the-art results using wav2vec 2.0's self-supervised learning approach, demonstrating that learning from raw audio alone can outperform traditional semi-supervised methods.
Q: What are the recommended use cases?
The model is ideal for Vietnamese speech recognition tasks, particularly for audio segments under 10 seconds. It's suitable for applications requiring high-accuracy transcription, though it's important to note the non-commercial license restrictions.