wav2vec2-large-xlsr-53-th
Property | Value |
---|---|
License | cc-by-sa-4.0 |
Downloads | 109,995 |
Language | Thai |
Framework | PyTorch |
What is wav2vec2-large-xlsr-53-th?
This is a state-of-the-art Thai speech recognition model that fine-tunes the wav2vec2-large-xlsr-53 architecture on Thai Common Voice 7.0 dataset. The model demonstrates impressive performance with a Word Error Rate (WER) of 0.95% using PyThaiNLP tokenization, significantly outperforming traditional Kaldi-based approaches and competing with major cloud providers' speech recognition services.
Implementation Details
The model was trained on 86,586 training samples using a single V100 GPU, with careful attention to hyperparameter optimization including attention dropout (0.1), hidden dropout (0.1), and mask time probability (0.05). The training process utilized gradient checkpointing and feature extractor freezing to optimize performance.
- Pre-processes audio by resampling to 16kHz
- Implements CTC loss reduction with mean strategy
- Utilizes PyThaiNLP for word tokenization
- Supports both word-level and syllable-level tokenization
Core Capabilities
- Achieves 0.95% WER with PyThaiNLP tokenization
- Performs with 2.81% Character Error Rate (CER)
- Handles continuous Thai speech recognition
- Supports real-time transcription
Frequently Asked Questions
Q: What makes this model unique?
This model represents one of the most accurate Thai speech recognition systems available, outperforming traditional approaches and matching or exceeding commercial solutions. Its integration with multiple Thai tokenization methods (PyThaiNLP, deepcut) makes it particularly versatile for Thai language processing.
Q: What are the recommended use cases?
The model is ideal for Thai speech transcription tasks, particularly in applications requiring high accuracy. It can be used for automated subtitling, voice command systems, and general speech-to-text applications for Thai language content.