kotoba-whisper-v2.0

Maintained By
kotoba-tech

Kotoba-Whisper v2.0

PropertyValue
Parameter Count756M
LicenseApache 2.0
LanguageJapanese
PaperKnowledge Distillation via Large-Scale Pseudo Labelling

What is kotoba-whisper-v2.0?

Kotoba-Whisper v2.0 is a specialized Japanese speech recognition model developed through collaboration between Asahi Ushio and Kotoba Technologies. It's a distilled version of OpenAI's Whisper large-v3, designed specifically for Japanese ASR tasks, offering 6.3x faster performance while maintaining competitive accuracy.

Implementation Details

The model utilizes the full encoder from Whisper large-v3 combined with a streamlined two-layer decoder. It was trained on the ReazonSpeech dataset, comprising over 7.2 million audio clips, each averaging 5 seconds with 18 text tokens. Training was conducted over 8 epochs with a batch size of 256 and 16kHz sampling rate.

  • Architecture: Modified Whisper with full encoder and reduced decoder
  • Training Data: ReazonSpeech dataset with WER filtering
  • Performance: 6.3x faster than Whisper large-v3
  • Accuracy: Better CER/WER on in-domain tests compared to large-v3

Core Capabilities

  • Efficient Japanese speech recognition with reduced latency
  • Support for both short-form and long-form transcription
  • Flash Attention 2 compatibility for improved performance
  • Segment-level timestamp generation
  • Batch processing support for long audio files

Frequently Asked Questions

Q: What makes this model unique?

The model combines the accuracy of Whisper large-v3 with significantly improved speed (6.3x faster) while specifically optimizing for Japanese language processing. It achieves this through careful architectural choices and specialized training on Japanese speech data.

Q: What are the recommended use cases?

The model is ideal for Japanese speech recognition tasks, particularly when processing speed is crucial. It's suitable for both short-form (< 30 seconds) and long-form audio transcription, with options for both sequential and chunked processing depending on accuracy vs. speed requirements.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.