anime-whisper

litagin

Japanese speech recognition model specialized for anime/game voices, fine-tuned on 5,300 hours of data. 756M params, achieves 13% CER on anime domain.

Property	Value
Parameter Count	756M parameters
Model Type	Automatic Speech Recognition
License	MIT
Base Model	kotoba-tech/kotoba-whisper-v2.0

What is anime-whisper?

Anime Whisper is a specialized Japanese speech recognition model designed specifically for anime and game voice acting. Fine-tuned on over 5,300 hours of anime-style voice data comprising 3.7 million files, it achieves superior performance in transcribing emotional and expressive speech typical in anime content.

Implementation Details

Built on the kotoba-whisper-v2.0 architecture, this model was trained using a two-phase approach: first training only the decoder while freezing the encoder, then fine-tuning the entire model. The training process utilized an H100 NVL GPU over approximately 11.2 days.

Achieves 13% Character Error Rate (CER) on anime domain testing
Handles non-verbal expressions like laughs, sighs, and stutters
Appropriate punctuation placement based on speech rhythm
High accuracy for emotional and expressive speech

Core Capabilities

Accurate transcription of Japanese anime-style voice acting
Faithful reproduction of non-verbal utterances
Natural punctuation placement
Reduced hallucination compared to general models
Efficient processing with 756M parameters

Frequently Asked Questions

Q: What makes this model unique?

The model excels in handling anime-style speech patterns, emotional expressions, and non-verbal utterances that other models typically struggle with. It maintains high accuracy while being relatively lightweight compared to larger speech recognition models.

Q: What are the recommended use cases?

This model is ideal for transcribing anime content, visual novels, and game voice acting. It's particularly effective for content with emotional delivery and non-standard speech patterns. However, it should be used without initial prompts as they can cause performance degradation.