XLM-RoBERTa Multilingual Text Genre Classifier
Property | Value |
---|---|
Parameter Count | 278M |
Languages Supported | 94 |
License | CC-BY-SA-4.0 |
Architecture | XLM-RoBERTa Base |
What is xlm-roberta-base-multilingual-text-genre-classifier?
This is a sophisticated text classification model designed to automatically identify the genre of text content across 94 different languages. Built on the XLM-RoBERTa architecture, it can classify text into 9 distinct genres including Information/Explanation, News, Instruction, Opinion/Argumentation, and more. The model demonstrates robust performance with a micro F1 score of 0.68 in cross-dataset evaluation, outperforming even GPT-4 in genre classification tasks.
Implementation Details
The model was fine-tuned using the X-GENRE dataset with carefully optimized hyperparameters including 15 training epochs, a learning rate of 1e-5, and maximum sequence length of 512 tokens. It operates best on texts with at least 75 words and includes confidence scoring to ensure reliable predictions.
- Supports 9 distinct genre categories with detailed classification criteria
- Implements confidence threshold filtering (recommended > 0.9)
- Achieves 0.92 F1 score after post-processing on production data
- Built using simpletransformers framework
Core Capabilities
- Multilingual genre classification across 94 languages
- Robust performance in cross-dataset scenarios
- Automatic confidence scoring for prediction reliability
- Suitable for large-scale text collection annotation
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to perform genre classification across 94 languages while maintaining high accuracy makes it unique. It outperforms other technologies including GPT-4 in cross-dataset evaluations, making it particularly valuable for large-scale text collection analysis.
Q: What are the recommended use cases?
The model is ideal for automatic genre identification in large text collections, library cataloging, content organization, and research purposes. It's particularly effective when applied to documents of sufficient length (75+ words) and when combined with confidence threshold filtering.