XLM-RoBERTa Multilingual Text Genre Classifier

Property	Value
Parameter Count	278M
Languages Supported	94
License	CC-BY-SA-4.0
Architecture	XLM-RoBERTa Base

What is xlm-roberta-base-multilingual-text-genre-classifier?

This is a sophisticated text classification model designed to automatically identify the genre of text content across 94 different languages. Built on the XLM-RoBERTa architecture, it can classify text into 9 distinct genres including Information/Explanation, News, Instruction, Opinion/Argumentation, and more. The model demonstrates robust performance with a micro F1 score of 0.68 in cross-dataset evaluation, outperforming even GPT-4 in genre classification tasks.

Implementation Details

The model was fine-tuned using the X-GENRE dataset with carefully optimized hyperparameters including 15 training epochs, a learning rate of 1e-5, and maximum sequence length of 512 tokens. It operates best on texts with at least 75 words and includes confidence scoring to ensure reliable predictions.

Supports 9 distinct genre categories with detailed classification criteria
Implements confidence threshold filtering (recommended > 0.9)
Achieves 0.92 F1 score after post-processing on production data
Built using simpletransformers framework

Core Capabilities

Multilingual genre classification across 94 languages
Robust performance in cross-dataset scenarios
Automatic confidence scoring for prediction reliability
Suitable for large-scale text collection annotation

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to perform genre classification across 94 languages while maintaining high accuracy makes it unique. It outperforms other technologies including GPT-4 in cross-dataset evaluations, making it particularly valuable for large-scale text collection analysis.

Q: What are the recommended use cases?

The model is ideal for automatic genre identification in large text collections, library cataloging, content organization, and research purposes. It's particularly effective when applied to documents of sufficient length (75+ words) and when combined with confidence threshold filtering.

xlm-roberta-base-multilingual-text-genre-classifier