wav2vec2-base-vietnamese-250h

Property	Value
Parameters	95M
License	CC BY-NC 4.0
Author	nguyenvulebinh
Training Data	13k hours pre-training, 250 hours fine-tuning

What is wav2vec2-base-vietnamese-250h?

wav2vec2-base-vietnamese-250h is a state-of-the-art Vietnamese speech recognition model based on Facebook's wav2vec 2.0 architecture. The model was pre-trained on 13,000 hours of unlabeled Vietnamese YouTube audio and fine-tuned on 250 hours of labeled VLSP ASR dataset data. It achieves impressive Word Error Rates (WER) of 6.15% on the VIVOS dataset and 11.52% on Common Voice VI when combined with a 4-gram language model.

Implementation Details

The model utilizes the wav2vec 2.0 architecture and Connectionist Temporal Classification (CTC) for fine-tuning. It processes 16kHz sampled speech audio and functions as an acoustic model that can be enhanced with an optional 4-gram language model for improved accuracy.

Pre-trained on 13k hours of unlabeled Vietnamese audio
Fine-tuned on 250 hours of labeled VLSP ASR data
Supports audio input sampled at 16kHz
Optimized for audio segments under 10 seconds

Core Capabilities

Achieves 6.15% WER on VIVOS dataset with language model
11.52% WER on Common Voice VI
Supports end-to-end speech recognition
Can be used with or without the 4-gram language model

Frequently Asked Questions

Q: What makes this model unique?

This model is the first Vietnamese speech recognition system that achieves state-of-the-art results using wav2vec 2.0's self-supervised learning approach, demonstrating that learning from raw audio alone can outperform traditional semi-supervised methods.

Q: What are the recommended use cases?

The model is ideal for Vietnamese speech recognition tasks, particularly for audio segments under 10 seconds. It's suitable for applications requiring high-accuracy transcription, though it's important to note the non-commercial license restrictions.