CrisperWhisper

Property	Value
Author	nyrahealth
License	CC-BY-NC-4.0
Paper Status	Accepted at INTERSPEECH 2024
Model Type	Speech Recognition

What is CrisperWhisper?

CrisperWhisper represents a significant advancement in speech recognition technology, specifically designed to capture verbatim speech with exceptional accuracy. As an enhanced version of OpenAI's Whisper, it introduces crucial improvements in word-level timestamp accuracy and disfluency detection. The model has achieved first place on the OpenASR Leaderboard for verbatim datasets, demonstrating its superior performance in real-world applications.

Implementation Details

The model employs Dynamic Time Warping (DTW) on Whisper cross-attention scores, combined with a specialized loss function for alignment heads. It uses a custom tokenizer and implements a unique attention loss during training to achieve precise timestamp generation. The training process involves three stages, including initial tokenizer adjustment, verbatim dataset training, and final attention loss optimization.

Utilizes WavLM augmentations during training for improved robustness
Implements a sophisticated pause distribution logic for accurate timing
Features 15 carefully selected alignment heads for optimal performance
Supports both English and German languages

Core Capabilities

Precise word-level timestamp generation with enhanced accuracy
Accurate transcription of disfluencies, including fillers like "um" and "uh"
Significantly reduced hallucination rates compared to standard Whisper
Superior segmentation performance, especially around pauses and disfluencies
Average WER of 6.66% across various datasets

Frequently Asked Questions

Q: What makes this model unique?

CrisperWhisper's unique strength lies in its ability to capture speech exactly as spoken, including disfluencies, false starts, and precise timing information. Unlike traditional speech recognition models that attempt to "clean up" speech, it maintains verbatim accuracy while providing highly precise timestamp information.

Q: What are the recommended use cases?

The model is particularly well-suited for applications requiring exact transcription, such as academic research, legal documentation, or any scenario where capturing speech patterns and timing is crucial. It excels in situations where disfluencies and precise temporal information are important for analysis.

CrisperWhisper

CrisperWhisper

What is CrisperWhisper?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models