bert-base-multilingual-cased-finetuned-yoruba
Property | Value |
---|---|
Author | Davlan |
Base Model | bert-base-multilingual-cased |
Language | Yoruba |
Training Hardware | NVIDIA V100 GPU |
What is bert-base-multilingual-cased-finetuned-yoruba?
This is a specialized BERT model fine-tuned specifically for the Yoruba language, built upon the bert-base-multilingual-cased architecture. It represents a significant advancement in African language processing, offering enhanced performance for Yoruba text analysis tasks compared to the standard multilingual BERT model.
Implementation Details
The model was trained on a diverse dataset including Bible texts, JW300, Menyo-20k, Yoruba Embedding corpus, CC-Aligned, Wikipedia, and various news sources including BBC Yoruba, VON Yoruba, Asejere, and Alaroye. Training was conducted on a single NVIDIA V100 GPU, focusing on optimizing performance for Yoruba language understanding.
- Achieves 82.58% F1 score on MasakhaNER (improvement over mBERT's 78.97%)
- Performs at 79.11% F1 score on BBC Yorùbá Text Classification (better than mBERT's 75.13%)
- Supports masked token prediction through the Transformers pipeline
Core Capabilities
- Named Entity Recognition in Yoruba text
- Text Classification tasks
- Masked Language Modeling
- Context-aware token prediction
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Yoruba language processing, offering superior performance compared to general multilingual models. It's trained on a comprehensive collection of Yoruba texts from various sources, making it particularly effective for real-world applications.
Q: What are the recommended use cases?
The model is ideal for Named Entity Recognition, text classification, and general Yoruba language understanding tasks. It's particularly suitable for processing news content, religious texts, and general Yoruba language documents.