CrisperWhisper
Property | Value |
---|---|
Author | nyrahealth |
License | CC-BY-NC-4.0 |
Paper Status | Accepted at INTERSPEECH 2024 |
Model Type | Speech Recognition |
What is CrisperWhisper?
CrisperWhisper represents a significant advancement in speech recognition technology, specifically designed to capture verbatim speech with exceptional accuracy. As an enhanced version of OpenAI's Whisper, it introduces crucial improvements in word-level timestamp accuracy and disfluency detection. The model has achieved first place on the OpenASR Leaderboard for verbatim datasets, demonstrating its superior performance in real-world applications.
Implementation Details
The model employs Dynamic Time Warping (DTW) on Whisper cross-attention scores, combined with a specialized loss function for alignment heads. It uses a custom tokenizer and implements a unique attention loss during training to achieve precise timestamp generation. The training process involves three stages, including initial tokenizer adjustment, verbatim dataset training, and final attention loss optimization.
- Utilizes WavLM augmentations during training for improved robustness
- Implements a sophisticated pause distribution logic for accurate timing
- Features 15 carefully selected alignment heads for optimal performance
- Supports both English and German languages
Core Capabilities
- Precise word-level timestamp generation with enhanced accuracy
- Accurate transcription of disfluencies, including fillers like "um" and "uh"
- Significantly reduced hallucination rates compared to standard Whisper
- Superior segmentation performance, especially around pauses and disfluencies
- Average WER of 6.66% across various datasets
Frequently Asked Questions
Q: What makes this model unique?
CrisperWhisper's unique strength lies in its ability to capture speech exactly as spoken, including disfluencies, false starts, and precise timing information. Unlike traditional speech recognition models that attempt to "clean up" speech, it maintains verbatim accuracy while providing highly precise timestamp information.
Q: What are the recommended use cases?
The model is particularly well-suited for applications requiring exact transcription, such as academic research, legal documentation, or any scenario where capturing speech patterns and timing is crucial. It excels in situations where disfluencies and precise temporal information are important for analysis.