vad-crdnn-libriparty

vad-crdnn-libriparty

speechbrain

Voice Activity Detection model using CRDNN architecture, trained on LibriParty dataset. Achieves 94.77% F-Score for speech/non-speech detection at 16kHz.

PropertyValue
AuthorSpeechBrain
Model TypeVoice Activity Detection (VAD)
ArchitectureCRDNN (Convolutional Recurrent Deep Neural Network)
Performance94.77% F-Score on LibriParty test set
Model LinkHuggingFace

What is vad-crdnn-libriparty?

The vad-crdnn-libriparty is a specialized Voice Activity Detection model developed by SpeechBrain that uses a Convolutional Recurrent Deep Neural Network architecture. It's designed to precisely identify speech segments within audio recordings, operating on 16kHz single-channel audio input. The model outputs precise timestamps for speech and non-speech segments, achieving impressive accuracy with a 95.18% precision and 94.37% recall on the LibriParty test set.

Implementation Details

The model implements a sophisticated pipeline for speech detection that involves multiple stages of processing:

  • Frame-level posterior probability computation using CRDNN
  • Threshold-based speech segment detection
  • Energy-based VAD refinement (optional)
  • Intelligent segment merging for close speech boundaries
  • Short segment removal for noise reduction
  • Double-check verification of speech segments

Core Capabilities

  • Processes both short and long audio recordings
  • Outputs precise timing for speech/non-speech segments
  • Supports GPU inference for faster processing
  • Provides visualization tools for VAD output
  • Offers flexible post-processing options

Frequently Asked Questions

Q: What makes this model unique?

This model combines CRDNN architecture with sophisticated post-processing steps, making it particularly effective for real-world applications. Its high accuracy and flexible pipeline allow for fine-tuned speech detection across various scenarios.

Q: What are the recommended use cases?

The model is ideal for applications requiring precise speech segment detection in audio recordings, such as automatic transcription systems, audio preprocessing, or speech analysis tools. It's particularly effective for 16kHz single-channel audio processing in controlled environments.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026