faster_CrisperWhisper

Property	Value
Author	nyrahealth
License	cc-by-nc-4.0
Paper	Research Paper
Downloads	367

What is faster_CrisperWhisper?

faster_CrisperWhisper is a converted version of CrisperWhisper optimized for the faster-whisper framework, designed to provide precise speech recognition with verbatim transcription capabilities. This model stands out for its ability to capture every spoken word exactly as it is, including fillers, pauses, stutters, and false starts - elements that traditional speech recognition models often omit.

Implementation Details

The model utilizes Dynamic Time Warping (DTW) on Whisper cross-attention scores and implements a specialized attention loss function to enhance timestamp accuracy. It's trained in three stages, incorporating WavLM augmentations and specialized data preparation techniques to ensure robust performance.

Implements custom attention loss for improved timestamp accuracy
Uses DTW-based alignment for precise word-level timestamps
Trained on a mixture of English and German datasets
Incorporates noise augmentation for improved hallucination resistance

Core Capabilities

Accurate word-level timestamp generation
Verbatim transcription including fillers and disfluencies
Sophisticated filler detection and transcription
Hallucination mitigation through specialized training
State-of-the-art performance on verbatim datasets (AMI, TED)

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to provide verbatim transcriptions with accurate timestamps, particularly around disfluencies and pauses, sets it apart. It achieves this through custom tokenization and specialized attention loss training.

Q: What are the recommended use cases?

This model is ideal for applications requiring exact transcriptions with precise timing, such as subtitle generation, linguistic research, or any scenario where capturing speech patterns, including fillers and disfluencies, is important.