Kotoba-Whisper v2.0

Property	Value
Parameter Count	756M
License	Apache 2.0
Language	Japanese
Paper	Knowledge Distillation via Large-Scale Pseudo Labelling

What is kotoba-whisper-v2.0?

Kotoba-Whisper v2.0 is a specialized Japanese speech recognition model developed through collaboration between Asahi Ushio and Kotoba Technologies. It's a distilled version of OpenAI's Whisper large-v3, designed specifically for Japanese ASR tasks, offering 6.3x faster performance while maintaining competitive accuracy.

Implementation Details

The model utilizes the full encoder from Whisper large-v3 combined with a streamlined two-layer decoder. It was trained on the ReazonSpeech dataset, comprising over 7.2 million audio clips, each averaging 5 seconds with 18 text tokens. Training was conducted over 8 epochs with a batch size of 256 and 16kHz sampling rate.

Architecture: Modified Whisper with full encoder and reduced decoder
Training Data: ReazonSpeech dataset with WER filtering
Performance: 6.3x faster than Whisper large-v3
Accuracy: Better CER/WER on in-domain tests compared to large-v3

Core Capabilities

Efficient Japanese speech recognition with reduced latency
Support for both short-form and long-form transcription
Flash Attention 2 compatibility for improved performance
Segment-level timestamp generation
Batch processing support for long audio files

Frequently Asked Questions

Q: What makes this model unique?

The model combines the accuracy of Whisper large-v3 with significantly improved speed (6.3x faster) while specifically optimizing for Japanese language processing. It achieves this through careful architectural choices and specialized training on Japanese speech data.

Q: What are the recommended use cases?

The model is ideal for Japanese speech recognition tasks, particularly when processing speed is crucial. It's suitable for both short-form (< 30 seconds) and long-form audio transcription, with options for both sequential and chunked processing depending on accuracy vs. speed requirements.