emotion-diarization-wavlm-large

Property	Value
License	Apache 2.0
Framework	PyTorch/SpeechBrain
Paper	Speech Emotion Diarization
Datasets	6 (ZaionEmotionDataset, IEMOCAP, RAVDESS, JL-corpus, ESD, EMOV-DB)

What is emotion-diarization-wavlm-large?

This is a specialized speech emotion diarization model that leverages the WavLM large architecture to detect and temporally locate different emotional segments within speech recordings. The model achieves a 29.7% Emotion Diarization Error Rate (EDER) on the ZaionEmotionDataset test set, making it effective for practical applications in emotion analysis.

Implementation Details

The system combines a WavLM encoder with a frame-wise classifier to predict emotion components and their boundaries in speech recordings. It processes 16kHz single-channel audio and includes automatic normalization for input preprocessing.

Built on SpeechBrain framework for robust speech processing
Supports GPU inference for faster processing
Handles multiple emotion categories including neutral, happy, and sad
Provides temporal boundaries for emotion segments

Core Capabilities

Automatic emotion boundary detection in continuous speech
Multi-emotion classification within single audio files
Temporal segmentation with precise start and end times
Processing of various audio formats with automatic normalization

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its ability to not just classify emotions but also identify their temporal boundaries within speech, trained on 6 diverse emotional datasets and achieving strong performance with a 29.7% EDER score.

Q: What are the recommended use cases?

The model is ideal for applications requiring detailed emotional analysis of speech, such as call center monitoring, mental health applications, or research in affective computing. It's particularly useful when temporal information about emotional changes is needed.