bert-base-multilingual-cased-finetuned-yoruba

Property	Value
Author	Davlan
Base Model	bert-base-multilingual-cased
Language	Yoruba
Training Hardware	NVIDIA V100 GPU

What is bert-base-multilingual-cased-finetuned-yoruba?

This is a specialized BERT model fine-tuned specifically for the Yoruba language, built upon the bert-base-multilingual-cased architecture. It represents a significant advancement in African language processing, offering enhanced performance for Yoruba text analysis tasks compared to the standard multilingual BERT model.

Implementation Details

The model was trained on a diverse dataset including Bible texts, JW300, Menyo-20k, Yoruba Embedding corpus, CC-Aligned, Wikipedia, and various news sources including BBC Yoruba, VON Yoruba, Asejere, and Alaroye. Training was conducted on a single NVIDIA V100 GPU, focusing on optimizing performance for Yoruba language understanding.

Achieves 82.58% F1 score on MasakhaNER (improvement over mBERT's 78.97%)
Performs at 79.11% F1 score on BBC Yorùbá Text Classification (better than mBERT's 75.13%)
Supports masked token prediction through the Transformers pipeline

Core Capabilities

Named Entity Recognition in Yoruba text
Text Classification tasks
Masked Language Modeling
Context-aware token prediction

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Yoruba language processing, offering superior performance compared to general multilingual models. It's trained on a comprehensive collection of Yoruba texts from various sources, making it particularly effective for real-world applications.

Q: What are the recommended use cases?

The model is ideal for Named Entity Recognition, text classification, and general Yoruba language understanding tasks. It's particularly suitable for processing news content, religious texts, and general Yoruba language documents.