speech-emotion-recognition-with-openai-whisper-large-v3

speech-emotion-recognition-with-openai-whisper-large-v3

firdhokk

Speech emotion recognition model based on Whisper Large V3, achieving 92% accuracy across 7 emotional states. Trained on RAVDESS, SAVEE, TESS, and URDU datasets.

PropertyValue
Base ModelOpenAI Whisper Large V3
TaskSpeech Emotion Recognition
Accuracy91.99%
Number of Emotions7 (Angry, Disgust, Fearful, Happy, Neutral, Sad, Surprised)
Model LinkHuggingFace

What is speech-emotion-recognition-with-openai-whisper-large-v3?

This is a specialized emotion recognition model that builds upon OpenAI's Whisper Large V3 architecture to detect emotions in speech. The model has been fine-tuned on a diverse dataset combining RAVDESS, SAVEE, TESS, and URDU speech samples, achieving impressive accuracy of 92% across seven distinct emotional states. It represents a significant advancement in automated emotion detection from audio inputs.

Implementation Details

The model utilizes a sophisticated training approach with carefully tuned hyperparameters including a learning rate of 5e-05, gradient accumulation over 5 steps, and mixed precision training. Audio preprocessing is handled through Librosa, with the Whisper Feature Extractor standardizing inputs for consistent analysis.

  • Training conducted over 25 epochs with effective batch size of 10
  • Implements Adam optimizer with specialized beta parameters
  • Features linear learning rate scheduling with 0.1 warmup ratio
  • Achieves 92.30% precision and 91.99% recall scores

Core Capabilities

  • Real-time emotion classification from speech input
  • Support for 7 distinct emotional states
  • Handles variable-length audio inputs up to 30 seconds
  • GPU-accelerated inference with CUDA support
  • Simple integration through HuggingFace Transformers library

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its high accuracy in emotion recognition (92%) and its foundation on the powerful Whisper Large V3 architecture. The balanced training dataset and careful hyperparameter tuning make it particularly robust for real-world applications.

Q: What are the recommended use cases?

The model is ideal for applications in sentiment analysis, customer service automation, mental health monitoring, and human-computer interaction where understanding emotional context is crucial. It's particularly suited for scenarios requiring real-time emotion detection from speech.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026