emotion-english-distilroberta-base

Property	Value
Author	j-hartmann
Downloads	1,020,570
Likes	357
Base Architecture	DistilRoBERTa
Evaluation Accuracy	66%

What is emotion-english-distilroberta-base?

This is a specialized emotion classification model built on the DistilRoBERTa architecture, designed to identify seven distinct emotions in English text: anger, disgust, fear, joy, neutral, sadness, and surprise. The model represents a significant advancement in emotion detection, trained on a carefully curated and balanced dataset of approximately 20,000 observations from diverse sources including Twitter, Reddit, student self-reports, and TV dialogues.

Implementation Details

The model is implemented using the Transformers architecture and can be easily deployed using PyTorch. It's built upon the DistilRoBERTa-base model, offering a more efficient, compressed version while maintaining robust performance. The training data includes 2,811 observations per emotion category, split 80/20 for training and evaluation.

Simple integration with Hugging Face's pipeline API
Supports batch processing for multiple examples
Compatible with various text formats including CSV files
Provides probability scores for all emotion categories

Core Capabilities

Multi-class emotion classification across 7 categories
66% evaluation accuracy (vs. 14% random baseline)
Processes both single texts and batch inputs
Returns confidence scores for each emotion category
Optimized for English language text analysis

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its comprehensive training on six diverse datasets, balanced representation across emotion categories, and its efficient architecture using DistilRoBERTa, making it suitable for production environments while maintaining high accuracy.

Q: What are the recommended use cases?

The model is ideal for sentiment analysis in social media monitoring, customer feedback analysis, content moderation, and research applications. It's particularly useful for analyzing Twitter and Reddit content, as demonstrated by its training data sources.