xlm-roberta-base-finetuned-ner-yoruba

Property	Value
Author	mbeukman
License	Apache License 2.0
Task	Named Entity Recognition (NER)
Base Model	xlm-roberta-base
F1 Score	78.22%

What is xlm-roberta-base-finetuned-ner-yoruba?

This is a specialized Named Entity Recognition (NER) model fine-tuned on the MasakhaNER dataset, specifically for the Yoruba language. Built upon the XLM-RoBERTa base model, it's designed to identify and classify named entities in Yoruba text, including persons, organizations, locations, and dates. The model was trained using significant computational resources, including an NVIDIA RTX3090 GPU, and achieved impressive performance metrics across different entity categories.

Implementation Details

The model was fine-tuned for 50 epochs using carefully selected hyperparameters: maximum sequence length of 200, batch size of 32, and learning rate of 5e-5. Training was repeated across 5 different random seeds to ensure robustness, with this version representing the best-performing iteration.

Training Time: 10-30 minutes per iteration
GPU Memory Required: 14GB (optimal), 6.5GB (minimum with batch size 1)
Architecture: XLM-RoBERTa with token classification head
Dataset: MasakhaNER Yoruba subset

Core Capabilities

Entity Detection: Persons (F1: 82%), Locations (F1: 80%), Organizations (F1: 71%), Dates (F1: 77%)
Token Classification: Support for 9 different label types including B-/I- prefixes
Multilingual Foundation: Built on XLM-RoBERTa's multilingual capabilities
African Language Support: Specialized for Yoruba text processing

Frequently Asked Questions

Q: What makes this model unique?

This model is part of the first large-scale effort to create high-quality NER models for African languages, specifically optimized for Yoruba. It demonstrates strong performance while addressing the critical need for NLP tools in under-resourced languages.

Q: What are the recommended use cases?

The model is primarily intended for NLP research purposes, including interpretability studies and transfer learning experiments. It's not recommended for production use due to potential limitations in generalizability and performance across different domains.