Whisper Small Cantonese

Property	Value
Parameter Count	242M
License	Apache 2.0
Paper	Research Paper
Model Type	Automatic Speech Recognition
CER Score	7.93% (without punctuation)

What is whisper-small-cantonese?

Whisper-small-cantonese is a specialized speech recognition model fine-tuned from OpenAI's Whisper-small architecture specifically for Cantonese language processing. This model represents a significant advancement in Cantonese ASR, trained on over 934 hours of diverse data including Common Voice, CantoMap, and YouTube content.

Implementation Details

The model utilizes a transformer-based architecture with several optimizations for performance. It supports both standard and Flash Attention implementations, with the latter reducing inference time from 0.308s to 0.055s per sample on GPU.

GPU VRAM Usage: ~1.5GB
Supports speculative decoding for faster processing
Compatible with Whisper.cpp and WhisperX/FasterWhisper via CT2

Core Capabilities

Fast inference with Flash Attention support
Excellent accuracy with 7.93% CER (without punctuation)
Efficient processing of long-form audio
Flexible deployment options (CPU/GPU)

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized optimization for Cantonese, extensive training data including pseudo-labeled content, and excellent balance of speed and accuracy. It achieves state-of-the-art performance while maintaining reasonable resource requirements.

Q: What are the recommended use cases?

The model is ideal for Cantonese speech transcription tasks, particularly in applications requiring real-time or near-real-time processing. It's suitable for both production environments and research applications, especially when dealing with varied Cantonese dialects and accents.