YushiUeda_swbd_sentiment_asr_train_asr_conformer

Property	Value
Framework	ESPnet v0.10.7a1
Model Type	ASR with Sentiment Analysis
Architecture	Conformer encoder + Transformer decoder
Token Type	Word-based
Model Link	HuggingFace

What is YushiUeda_swbd_sentiment_asr_train_asr_conformer?

This is a sophisticated speech recognition model that combines ASR capabilities with sentiment analysis. Built on ESPnet, it employs a Conformer-based encoder and Transformer-based decoder architecture to transcribe speech while simultaneously predicting sentiment (Positive, Neutral, Negative). The model was trained on the Switchboard corpus using spectral augmentation techniques.

Implementation Details

The model architecture features a 12-block Conformer encoder with 512 output dimensions, 4 attention heads, and 2048 linear units. The decoder utilizes a 6-block Transformer with matching attention configurations. The implementation includes advanced features like spectral augmentation and utterance-level mean-variance normalization.

Conformer encoder with CNN modules (kernel size 31)
Relative positional encoding
Macaron-style architecture
Swish activation function
Joint CTC-Attention training with CTC weight of 0.5

Core Capabilities

Speech recognition with word-level tokenization
Concurrent sentiment classification
Achieves ~61% Macro F1 score on sentiment classification
~65% Weighted F1 score on validation set
Handles conversational speech input

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its dual capability of performing ASR and sentiment analysis simultaneously, using a state-of-the-art Conformer-Transformer architecture. The integration of spectral augmentation and advanced positional encoding makes it particularly robust for conversational speech processing.

Q: What are the recommended use cases?

The model is best suited for applications requiring both speech transcription and sentiment analysis, such as call center analytics, customer feedback processing, or conversation analysis. It's particularly effective for English conversational speech in contexts similar to the Switchboard corpus.