Whisper Large Chinese (Mandarin)

Property	Value
License	Apache 2.0
Base Model	openai/whisper-large-v2
Training Data	Common Voice 11
Primary Task	Automatic Speech Recognition

What is whisper-large-zh-cv11?

Whisper-large-zh-cv11 is a specialized speech recognition model fine-tuned from OpenAI's Whisper Large v2 specifically for Mandarin Chinese. Developed by Jonatas Grosman, it demonstrates significant improvements over the base model, achieving a Character Error Rate (CER) of 9.55% on the Common Voice 11 test set, compared to 29.90% for the original model.

Implementation Details

The model was trained using both training and validation splits from Common Voice 11, with 1,000 samples reserved for evaluation during fine-tuning. It implements the Transformer architecture and runs on PyTorch, offering seamless integration with the Hugging Face transformers library.

Supports both raw and normalized text transcription
Handles casing and punctuation
Optimized for Mandarin Chinese speech recognition

Core Capabilities

Achieves 9.55% CER and 55.02% WER on Common Voice 11
Performs well on out-of-domain data (11.76% CER on Fleurs dataset)
Supports specialized handling of numerical transcriptions
Includes language and task-specific decoder prompts

Frequently Asked Questions

Q: What makes this model unique?

This model significantly outperforms the base Whisper Large v2 on Mandarin Chinese, reducing CER by over 20 percentage points on Common Voice 11. It's specifically optimized for Chinese speech recognition while maintaining the ability to handle different text normalization scenarios.

Q: What are the recommended use cases?

The model is ideal for Mandarin Chinese speech transcription tasks, particularly when high character-level accuracy is required. It's suitable for both general transcription and scenarios requiring normalized text output, though users should be aware of potential limitations with numerical value transcriptions.

whisper-large-zh-cv11