wav2vec2-large-chinese-zh-cn
Property | Value |
---|---|
Author | wbbbbb |
Base Model | facebook/wav2vec2-large-xlsr-53 |
Training Data | Common Voice 6.1, CSS10, ST-CMDS |
Performance | CER: 12.30%, WER: 70.47% |
Hardware Used | RTX3090 (50h training) |
What is wav2vec2-large-chinese-zh-cn?
This is a specialized speech recognition model fine-tuned for Mandarin Chinese, based on Facebook's wav2vec2-large-xlsr-53 architecture. It represents a significant improvement over existing Chinese ASR models, achieving a Character Error Rate (CER) of 12.30%, substantially better than comparable models in the field.
Implementation Details
The model has been fine-tuned on multiple high-quality Chinese speech datasets, including Common Voice 6.1, CSS10, and ST-CMDS. It requires 16kHz audio input and can be easily implemented using the HuggingSound library for speech recognition tasks.
- Built on wav2vec2-large-xlsr-53 architecture
- Optimized for Mandarin Chinese recognition
- Trained for 50 hours on RTX3090 GPU
- Direct integration with HuggingSound library
Core Capabilities
- High-accuracy Chinese speech recognition
- Direct transcription without language model requirement
- Batch processing support
- Efficient inference on GPU
- Support for various audio format inputs
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its superior performance in Chinese speech recognition, achieving a 12.30% CER, which is significantly better than other publicly available models. It's been extensively trained on diverse Chinese speech datasets and optimized for real-world applications.
Q: What are the recommended use cases?
The model is ideal for Chinese speech transcription tasks, particularly in applications requiring high accuracy without the need for a separate language model. It's suitable for both batch processing and real-time transcription scenarios, provided the audio input is sampled at 16kHz.