YushiUeda_swbd_sentiment_asr_train_asr_conformer
Property | Value |
---|---|
Framework | ESPnet v0.10.7a1 |
Model Type | ASR with Sentiment Analysis |
Architecture | Conformer encoder + Transformer decoder |
Token Type | Word-based |
Model Link | HuggingFace |
What is YushiUeda_swbd_sentiment_asr_train_asr_conformer?
This is a sophisticated speech recognition model that combines ASR capabilities with sentiment analysis. Built on ESPnet, it employs a Conformer-based encoder and Transformer-based decoder architecture to transcribe speech while simultaneously predicting sentiment (Positive, Neutral, Negative). The model was trained on the Switchboard corpus using spectral augmentation techniques.
Implementation Details
The model architecture features a 12-block Conformer encoder with 512 output dimensions, 4 attention heads, and 2048 linear units. The decoder utilizes a 6-block Transformer with matching attention configurations. The implementation includes advanced features like spectral augmentation and utterance-level mean-variance normalization.
- Conformer encoder with CNN modules (kernel size 31)
- Relative positional encoding
- Macaron-style architecture
- Swish activation function
- Joint CTC-Attention training with CTC weight of 0.5
Core Capabilities
- Speech recognition with word-level tokenization
- Concurrent sentiment classification
- Achieves ~61% Macro F1 score on sentiment classification
- ~65% Weighted F1 score on validation set
- Handles conversational speech input
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its dual capability of performing ASR and sentiment analysis simultaneously, using a state-of-the-art Conformer-Transformer architecture. The integration of spectral augmentation and advanced positional encoding makes it particularly robust for conversational speech processing.
Q: What are the recommended use cases?
The model is best suited for applications requiring both speech transcription and sentiment analysis, such as call center analytics, customer feedback processing, or conversation analysis. It's particularly effective for English conversational speech in contexts similar to the Switchboard corpus.