CrisperWhisper

Maintained By
nyrahealth

CrisperWhisper

PropertyValue
Authornyrahealth
LicenseCC-BY-NC-4.0
Paper StatusAccepted at INTERSPEECH 2024
Model TypeSpeech Recognition

What is CrisperWhisper?

CrisperWhisper represents a significant advancement in speech recognition technology, specifically designed to capture verbatim speech with exceptional accuracy. As an enhanced version of OpenAI's Whisper, it introduces crucial improvements in word-level timestamp accuracy and disfluency detection. The model has achieved first place on the OpenASR Leaderboard for verbatim datasets, demonstrating its superior performance in real-world applications.

Implementation Details

The model employs Dynamic Time Warping (DTW) on Whisper cross-attention scores, combined with a specialized loss function for alignment heads. It uses a custom tokenizer and implements a unique attention loss during training to achieve precise timestamp generation. The training process involves three stages, including initial tokenizer adjustment, verbatim dataset training, and final attention loss optimization.

  • Utilizes WavLM augmentations during training for improved robustness
  • Implements a sophisticated pause distribution logic for accurate timing
  • Features 15 carefully selected alignment heads for optimal performance
  • Supports both English and German languages

Core Capabilities

  • Precise word-level timestamp generation with enhanced accuracy
  • Accurate transcription of disfluencies, including fillers like "um" and "uh"
  • Significantly reduced hallucination rates compared to standard Whisper
  • Superior segmentation performance, especially around pauses and disfluencies
  • Average WER of 6.66% across various datasets

Frequently Asked Questions

Q: What makes this model unique?

CrisperWhisper's unique strength lies in its ability to capture speech exactly as spoken, including disfluencies, false starts, and precise timing information. Unlike traditional speech recognition models that attempt to "clean up" speech, it maintains verbatim accuracy while providing highly precise timestamp information.

Q: What are the recommended use cases?

The model is particularly well-suited for applications requiring exact transcription, such as academic research, legal documentation, or any scenario where capturing speech patterns and timing is crucial. It excels in situations where disfluencies and precise temporal information are important for analysis.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.