kotoba-whisper-v2.0

kotoba-whisper-v2.0

kotoba-tech

[756M parameter Japanese speech recognition model based on Whisper, 6.3x faster than large-v3 with competitive accuracy on Japanese ASR tasks]

PropertyValue
Parameter Count756M
LicenseApache 2.0
LanguageJapanese
PaperKnowledge Distillation via Large-Scale Pseudo Labelling

What is kotoba-whisper-v2.0?

Kotoba-Whisper v2.0 is a specialized Japanese speech recognition model developed through collaboration between Asahi Ushio and Kotoba Technologies. It's a distilled version of OpenAI's Whisper large-v3, designed specifically for Japanese ASR tasks, offering 6.3x faster performance while maintaining competitive accuracy.

Implementation Details

The model utilizes the full encoder from Whisper large-v3 combined with a streamlined two-layer decoder. It was trained on the ReazonSpeech dataset, comprising over 7.2 million audio clips, each averaging 5 seconds with 18 text tokens. Training was conducted over 8 epochs with a batch size of 256 and 16kHz sampling rate.

  • Architecture: Modified Whisper with full encoder and reduced decoder
  • Training Data: ReazonSpeech dataset with WER filtering
  • Performance: 6.3x faster than Whisper large-v3
  • Accuracy: Better CER/WER on in-domain tests compared to large-v3

Core Capabilities

  • Efficient Japanese speech recognition with reduced latency
  • Support for both short-form and long-form transcription
  • Flash Attention 2 compatibility for improved performance
  • Segment-level timestamp generation
  • Batch processing support for long audio files

Frequently Asked Questions

Q: What makes this model unique?

The model combines the accuracy of Whisper large-v3 with significantly improved speed (6.3x faster) while specifically optimizing for Japanese language processing. It achieves this through careful architectural choices and specialized training on Japanese speech data.

Q: What are the recommended use cases?

The model is ideal for Japanese speech recognition tasks, particularly when processing speed is crucial. It's suitable for both short-form (< 30 seconds) and long-form audio transcription, with options for both sequential and chunked processing depending on accuracy vs. speed requirements.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026