wav2vec2-large-robust-12-ft-emotion-msp-dim
Property | Value |
---|---|
Parameter Count | 165M parameters |
License | CC-BY-NC-SA 4.0 |
Paper | Research Paper |
Framework | PyTorch |
Dataset | MSP-Podcast v1.7 |
What is wav2vec2-large-robust-12-ft-emotion-msp-dim?
This is a specialized speech emotion recognition model based on the Wav2vec 2.0 architecture, fine-tuned specifically for dimensional emotion recognition. The model has been pruned from 24 to 12 transformer layers and trained on the MSP-Podcast dataset to analyze emotional characteristics in speech.
Implementation Details
The model processes raw audio signals at 16kHz sampling rate and outputs predictions for three emotional dimensions: arousal, dominance, and valence, with values ranging from 0 to 1. It's built upon the Wav2Vec2-Large-Robust architecture and implements a regression head for emotion prediction.
- Processes raw audio input through a Wav2Vec2 backbone
- Uses a custom regression head for dimensional emotion prediction
- Outputs both embeddings and emotional dimension scores
- Implements efficient pruning (12 transformer layers)
Core Capabilities
- Dimensional emotion recognition from speech
- Feature extraction through pooled hidden states
- Real-time audio processing capability
- Research-focused emotional analysis
Frequently Asked Questions
Q: What makes this model unique?
This model uniquely combines the robust speech processing capabilities of Wav2vec 2.0 with dimensional emotion recognition, offering a more nuanced approach to emotion analysis compared to categorical models. Its pruned architecture maintains performance while reducing computational requirements.
Q: What are the recommended use cases?
The model is specifically designed for research purposes in speech emotion recognition. It's particularly useful for applications requiring continuous emotional dimension analysis, such as psychological research, human-computer interaction studies, and speech analysis research.