wav2vec2-base-vi
Property | Value |
---|---|
Parameter Count | 95M |
License | CC-BY-NC-4.0 |
Training Data | 13k hours YouTube Vietnamese audio |
Architecture | Wav2Vec2 Base |
What is wav2vec2-base-vi?
wav2vec2-base-vi is a self-supervised learning model designed specifically for Vietnamese speech recognition. Developed by nguyenvulebinh, it's trained on a diverse dataset of 13,000 hours of Vietnamese YouTube audio, including clean audio, noise audio, conversation, and multiple genders and dialects. The model employs the wav2vec2 architecture, which has proven highly effective for speech processing tasks.
Implementation Details
The model was trained for 35 epochs using TPU V3-8, implementing the wav2vec2 architecture that follows the same structure as its English counterpart. It's designed to be easily integrated using the Transformers library and can be fine-tuned for specific speech recognition tasks.
- Transformer-based architecture optimized for Vietnamese speech
- Trained on diverse audio sources ensuring robust performance
- Compatible with Hugging Face's Transformers library
- Supports both base (95M params) and large (317M params) versions
Core Capabilities
- Self-supervised learning for speech recognition
- Achieves 8.66% WER without LM and 6.53% with 5-grams LM on VLSP 2020 dataset
- Supports both inference with and without language model integration
- Handles various Vietnamese dialects and audio conditions
Frequently Asked Questions
Q: What makes this model unique?
The model is specifically trained on a massive Vietnamese audio dataset, making it one of the largest Vietnamese speech models available. Its architecture is optimized for Vietnamese language characteristics while maintaining compatibility with standard wav2vec2 implementations.
Q: What are the recommended use cases?
The model is ideal for Vietnamese speech recognition tasks, particularly in applications requiring transcription of YouTube content, conversational audio, or mixed-condition speech. It can be used both with and without a language model, depending on accuracy requirements.