Kotoba-Whisper v2.0

Property	Value
Parameter Count	756M
License	Apache 2.0
Language	Japanese
Paper	Link
Tensor Type	BF16

What is kotoba-whisper-v2.0?

Kotoba-Whisper v2.0 is a specialized Japanese Automatic Speech Recognition (ASR) model developed through collaboration between Asahi Ushio and Kotoba Technologies. It's a distilled version of OpenAI's Whisper large-v3, offering 6.3x faster performance while maintaining comparable accuracy. The model was trained on over 7.2 million audio clips from the ReazonSpeech dataset.

Implementation Details

The model employs a unique architecture combining the full encoder of Whisper large-v3 with a simplified decoder using only two layers. It's optimized for 16kHz audio and supports both short-form and long-form transcription tasks.

Achieves better CER and WER than Whisper large-v3 on in-domain tests
Supports Flash Attention 2 for improved performance
Includes timestamps and chunked processing capabilities
Compatible with both sequential and chunked long-form transcription

Core Capabilities

Fast transcription: 6.3x faster than Whisper large-v3
High accuracy: 9.2% CER on CommonVoice 8.0
Efficient processing: Supports batch processing and GPU acceleration
Flexible deployment: Compatible with Transformers pipeline

Frequently Asked Questions

Q: What makes this model unique?

The model combines speed optimization with high accuracy specifically for Japanese ASR, making it particularly efficient for production environments while maintaining competitive error rates.

Q: What are the recommended use cases?

It's ideal for Japanese speech transcription tasks, particularly in production environments where both speed and accuracy are crucial. It can handle both short-form (>30s) and long-form audio with various optimization options.