Kotoba-Whisper v2.0
Property | Value |
---|---|
Parameter Count | 756M |
License | Apache 2.0 |
Language | Japanese |
Paper | Link |
Tensor Type | BF16 |
What is kotoba-whisper-v2.0?
Kotoba-Whisper v2.0 is a specialized Japanese Automatic Speech Recognition (ASR) model developed through collaboration between Asahi Ushio and Kotoba Technologies. It's a distilled version of OpenAI's Whisper large-v3, offering 6.3x faster performance while maintaining comparable accuracy. The model was trained on over 7.2 million audio clips from the ReazonSpeech dataset.
Implementation Details
The model employs a unique architecture combining the full encoder of Whisper large-v3 with a simplified decoder using only two layers. It's optimized for 16kHz audio and supports both short-form and long-form transcription tasks.
- Achieves better CER and WER than Whisper large-v3 on in-domain tests
- Supports Flash Attention 2 for improved performance
- Includes timestamps and chunked processing capabilities
- Compatible with both sequential and chunked long-form transcription
Core Capabilities
- Fast transcription: 6.3x faster than Whisper large-v3
- High accuracy: 9.2% CER on CommonVoice 8.0
- Efficient processing: Supports batch processing and GPU acceleration
- Flexible deployment: Compatible with Transformers pipeline
Frequently Asked Questions
Q: What makes this model unique?
The model combines speed optimization with high accuracy specifically for Japanese ASR, making it particularly efficient for production environments while maintaining competitive error rates.
Q: What are the recommended use cases?
It's ideal for Japanese speech transcription tasks, particularly in production environments where both speed and accuracy are crucial. It can handle both short-form (>30s) and long-form audio with various optimization options.