wav2vec2-large-xlsr-53-chinese-zh-cn-gpt

wav2vec2-large-xlsr-53-chinese-zh-cn-gpt

ydshieh

Fine-tuned Wav2Vec2 model for Mandarin Chinese ASR, trained on Common Voice zh-CN/TW datasets, achieving 20.9% CER on test set. Optimized for 16kHz audio input.

PropertyValue
Base Modelfacebook/wav2vec2-large-xlsr-53
Authorydshieh
Model HubHuggingFace
Test CER20.90%

What is wav2vec2-large-xlsr-53-chinese-zh-cn-gpt?

This is a specialized speech recognition model fine-tuned for Mandarin Chinese (zh-CN) based on Facebook's wav2vec2-large-xlsr-53 architecture. The model has been specifically adapted to handle Chinese speech recognition tasks by training on both mainland Chinese (zh-CN) and Taiwanese (zh-TW) datasets from Common Voice, with all text labels converted to simplified Chinese.

Implementation Details

The model operates on 16kHz audio input and employs the Wav2Vec2 architecture with CTC (Connectionist Temporal Classification) for speech recognition. It includes preprocessing steps for audio resampling and text normalization, particularly handling Chinese-specific characters and punctuation.

  • Requires 16kHz audio input sampling rate
  • Implemented using the Transformers library and PyTorch
  • Includes comprehensive character filtering for Chinese text processing
  • Supports batch processing for efficient inference

Core Capabilities

  • Direct speech-to-text transcription without requiring a language model
  • Handles both mainland Chinese and Taiwanese speech patterns
  • Achieves 20.90% Character Error Rate (CER) on test data
  • Supports batch processing for multiple audio inputs

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specific optimization for Chinese speech recognition, incorporating both mainland Chinese and Taiwanese speech patterns while outputting simplified Chinese text. The careful preprocessing and character handling make it particularly suitable for real-world Chinese ASR applications.

Q: What are the recommended use cases?

The model is ideal for applications requiring Mandarin Chinese speech recognition, such as transcription services, voice assistants, and audio content analysis. It's particularly useful when working with mixed Chinese dialects due to its training on both zh-CN and zh-TW datasets.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026