emotion-diarization-wavlm-large
Property | Value |
---|---|
License | Apache 2.0 |
Framework | PyTorch/SpeechBrain |
Paper | Speech Emotion Diarization |
Datasets | 6 (ZaionEmotionDataset, IEMOCAP, RAVDESS, JL-corpus, ESD, EMOV-DB) |
What is emotion-diarization-wavlm-large?
This is a specialized speech emotion diarization model that leverages the WavLM large architecture to detect and temporally locate different emotional segments within speech recordings. The model achieves a 29.7% Emotion Diarization Error Rate (EDER) on the ZaionEmotionDataset test set, making it effective for practical applications in emotion analysis.
Implementation Details
The system combines a WavLM encoder with a frame-wise classifier to predict emotion components and their boundaries in speech recordings. It processes 16kHz single-channel audio and includes automatic normalization for input preprocessing.
- Built on SpeechBrain framework for robust speech processing
- Supports GPU inference for faster processing
- Handles multiple emotion categories including neutral, happy, and sad
- Provides temporal boundaries for emotion segments
Core Capabilities
- Automatic emotion boundary detection in continuous speech
- Multi-emotion classification within single audio files
- Temporal segmentation with precise start and end times
- Processing of various audio formats with automatic normalization
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its ability to not just classify emotions but also identify their temporal boundaries within speech, trained on 6 diverse emotional datasets and achieving strong performance with a 29.7% EDER score.
Q: What are the recommended use cases?
The model is ideal for applications requiring detailed emotional analysis of speech, such as call center monitoring, mental health applications, or research in affective computing. It's particularly useful when temporal information about emotional changes is needed.