kotoba-whisper-v2.0

Maintained By
kotoba-tech

Kotoba-Whisper v2.0

PropertyValue
Parameter Count756M
LicenseApache 2.0
LanguageJapanese
PaperLink
Tensor TypeBF16

What is kotoba-whisper-v2.0?

Kotoba-Whisper v2.0 is a specialized Japanese Automatic Speech Recognition (ASR) model developed through collaboration between Asahi Ushio and Kotoba Technologies. It's a distilled version of OpenAI's Whisper large-v3, offering 6.3x faster performance while maintaining comparable accuracy. The model was trained on over 7.2 million audio clips from the ReazonSpeech dataset.

Implementation Details

The model employs a unique architecture combining the full encoder of Whisper large-v3 with a simplified decoder using only two layers. It's optimized for 16kHz audio and supports both short-form and long-form transcription tasks.

  • Achieves better CER and WER than Whisper large-v3 on in-domain tests
  • Supports Flash Attention 2 for improved performance
  • Includes timestamps and chunked processing capabilities
  • Compatible with both sequential and chunked long-form transcription

Core Capabilities

  • Fast transcription: 6.3x faster than Whisper large-v3
  • High accuracy: 9.2% CER on CommonVoice 8.0
  • Efficient processing: Supports batch processing and GPU acceleration
  • Flexible deployment: Compatible with Transformers pipeline

Frequently Asked Questions

Q: What makes this model unique?

The model combines speed optimization with high accuracy specifically for Japanese ASR, making it particularly efficient for production environments while maintaining competitive error rates.

Q: What are the recommended use cases?

It's ideal for Japanese speech transcription tasks, particularly in production environments where both speed and accuracy are crucial. It can handle both short-form (>30s) and long-form audio with various optimization options.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.