wav2vec2-large-xlsr-53-th

Property	Value
License	cc-by-sa-4.0
Downloads	109,995
Language	Thai
Framework	PyTorch

What is wav2vec2-large-xlsr-53-th?

This is a state-of-the-art Thai speech recognition model that fine-tunes the wav2vec2-large-xlsr-53 architecture on Thai Common Voice 7.0 dataset. The model demonstrates impressive performance with a Word Error Rate (WER) of 0.95% using PyThaiNLP tokenization, significantly outperforming traditional Kaldi-based approaches and competing with major cloud providers' speech recognition services.

Implementation Details

The model was trained on 86,586 training samples using a single V100 GPU, with careful attention to hyperparameter optimization including attention dropout (0.1), hidden dropout (0.1), and mask time probability (0.05). The training process utilized gradient checkpointing and feature extractor freezing to optimize performance.

Pre-processes audio by resampling to 16kHz
Implements CTC loss reduction with mean strategy
Utilizes PyThaiNLP for word tokenization
Supports both word-level and syllable-level tokenization

Core Capabilities

Achieves 0.95% WER with PyThaiNLP tokenization
Performs with 2.81% Character Error Rate (CER)
Handles continuous Thai speech recognition
Supports real-time transcription

Frequently Asked Questions

Q: What makes this model unique?

This model represents one of the most accurate Thai speech recognition systems available, outperforming traditional approaches and matching or exceeding commercial solutions. Its integration with multiple Thai tokenization methods (PyThaiNLP, deepcut) makes it particularly versatile for Thai language processing.

Q: What are the recommended use cases?

The model is ideal for Thai speech transcription tasks, particularly in applications requiring high accuracy. It can be used for automated subtitling, voice command systems, and general speech-to-text applications for Thai language content.