arabic-t5-small
Property | Value |
---|---|
Author | flax-community |
Training Time | 22h 23m 51s |
Evaluation Accuracy | 56.84% |
Vocabulary Size | 64,000 |
Model URL | HuggingFace |
What is arabic-t5-small?
arabic-t5-small is a T5v1.1 small model specifically trained for Arabic language processing. The model was trained on a comprehensive dataset combining the Arabic Billion Words corpus and Arabic subsets from mC4 and Oscar datasets. Due to time constraints, the training covered approximately 10% of the complete dataset, equivalent to 22,000 steps or 4.3 billion tokens.
Implementation Details
The model employs a unique approach to Arabic text processing by preserving diacritics in the vocabulary, contrary to other Arabic language models. Training was conducted with a batch size of 384, using a learning rate of 1e-2 and jnp.float32 dtype. The preprocessing was intentionally minimal, only replacing URLs, emails, and social media mentions with fixed tokens.
- Training batch size: 384
- Evaluation batch size: 768
- Learning rate: 1e-2
- Tokenizer trained on 5% of training set
- Vocabulary size: 64,000 tokens
Core Capabilities
- Arabic text generation and processing
- Preserves Arabic diacritics for enhanced linguistic accuracy
- Suitable for fine-tuning on specific tasks
- Achieves 56.84% evaluation accuracy
Frequently Asked Questions
Q: What makes this model unique?
This model stands out by maintaining Arabic diacritics in its vocabulary, unlike most other Arabic language models. It also uses a minimalistic preprocessing approach, focusing on preserving the natural structure of Arabic text while only handling technical elements like URLs and social media mentions.
Q: What are the recommended use cases?
The model is particularly suitable for Arabic text processing tasks requiring diacritic sensitivity. For fine-tuning, it's recommended to enable dropout (recommended rate: 0.1) as the pre-training was done with dropout turned off.