speech-emotion-recognition-with-openai-whisper-large-v3

Maintained By
firdhokk

Speech Emotion Recognition with Whisper Large V3

PropertyValue
Base ModelOpenAI Whisper Large V3
TaskSpeech Emotion Recognition
Accuracy91.99%
Number of Emotions7 (Angry, Disgust, Fearful, Happy, Neutral, Sad, Surprised)
Model LinkHuggingFace

What is speech-emotion-recognition-with-openai-whisper-large-v3?

This is a specialized emotion recognition model that builds upon OpenAI's Whisper Large V3 architecture to detect emotions in speech. The model has been fine-tuned on a diverse dataset combining RAVDESS, SAVEE, TESS, and URDU speech samples, achieving impressive accuracy of 92% across seven distinct emotional states. It represents a significant advancement in automated emotion detection from audio inputs.

Implementation Details

The model utilizes a sophisticated training approach with carefully tuned hyperparameters including a learning rate of 5e-05, gradient accumulation over 5 steps, and mixed precision training. Audio preprocessing is handled through Librosa, with the Whisper Feature Extractor standardizing inputs for consistent analysis.

  • Training conducted over 25 epochs with effective batch size of 10
  • Implements Adam optimizer with specialized beta parameters
  • Features linear learning rate scheduling with 0.1 warmup ratio
  • Achieves 92.30% precision and 91.99% recall scores

Core Capabilities

  • Real-time emotion classification from speech input
  • Support for 7 distinct emotional states
  • Handles variable-length audio inputs up to 30 seconds
  • GPU-accelerated inference with CUDA support
  • Simple integration through HuggingFace Transformers library

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its high accuracy in emotion recognition (92%) and its foundation on the powerful Whisper Large V3 architecture. The balanced training dataset and careful hyperparameter tuning make it particularly robust for real-world applications.

Q: What are the recommended use cases?

The model is ideal for applications in sentiment analysis, customer service automation, mental health monitoring, and human-computer interaction where understanding emotional context is crucial. It's particularly suited for scenarios requiring real-time emotion detection from speech.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.