wav2vec2-large-xlsr-53-chinese-zh-cn
Property | Value |
---|---|
License | Apache 2.0 |
Downloads | 1.85M+ |
Test WER | 82.37% |
Test CER | 19.03% |
What is wav2vec2-large-xlsr-53-chinese-zh-cn?
This is a fine-tuned version of Facebook's wav2vec2-large-xlsr-53 model specifically optimized for Chinese speech recognition. Developed by Jonatas Grosman, it's trained on Common Voice 6.1, CSS10, and ST-CMDS datasets, making it particularly effective for processing 16kHz Chinese speech audio.
Implementation Details
The model utilizes the Wav2Vec2ForCTC architecture for speech recognition tasks, implementing character-level tokenization for Chinese text. It processes audio at 16kHz sampling rate and employs advanced speech processing techniques through the Transformers framework.
- Built on the wav2vec2-large-xlsr-53 backbone architecture
- Trained with Common Voice, CSS10, and ST-CMDS datasets
- Implements CTC (Connectionist Temporal Classification) for sequence modeling
- Supports batch processing for efficient inference
Core Capabilities
- Direct speech-to-text transcription without language model
- Character Error Rate (CER) of 19.03% on test set
- Handles continuous Chinese speech recognition
- Supports both wav and mp3 audio formats
- Optimized for 16kHz audio input
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its specialized fine-tuning on Chinese speech data, achieving competitive CER rates without requiring a language model. It's particularly notable for its extensive deployment, with over 1.8 million downloads.
Q: What are the recommended use cases?
The model is ideal for Chinese speech transcription tasks, particularly in scenarios requiring 16kHz audio processing. It's suitable for both batch processing and real-time transcription applications, though users should note the 19.03% CER when considering accuracy requirements.