arabic-t5-small

Maintained By
flax-community

arabic-t5-small

PropertyValue
Authorflax-community
Training Time22h 23m 51s
Evaluation Accuracy56.84%
Vocabulary Size64,000
Model URLHuggingFace

What is arabic-t5-small?

arabic-t5-small is a T5v1.1 small model specifically trained for Arabic language processing. The model was trained on a comprehensive dataset combining the Arabic Billion Words corpus and Arabic subsets from mC4 and Oscar datasets. Due to time constraints, the training covered approximately 10% of the complete dataset, equivalent to 22,000 steps or 4.3 billion tokens.

Implementation Details

The model employs a unique approach to Arabic text processing by preserving diacritics in the vocabulary, contrary to other Arabic language models. Training was conducted with a batch size of 384, using a learning rate of 1e-2 and jnp.float32 dtype. The preprocessing was intentionally minimal, only replacing URLs, emails, and social media mentions with fixed tokens.

  • Training batch size: 384
  • Evaluation batch size: 768
  • Learning rate: 1e-2
  • Tokenizer trained on 5% of training set
  • Vocabulary size: 64,000 tokens

Core Capabilities

  • Arabic text generation and processing
  • Preserves Arabic diacritics for enhanced linguistic accuracy
  • Suitable for fine-tuning on specific tasks
  • Achieves 56.84% evaluation accuracy

Frequently Asked Questions

Q: What makes this model unique?

This model stands out by maintaining Arabic diacritics in its vocabulary, unlike most other Arabic language models. It also uses a minimalistic preprocessing approach, focusing on preserving the natural structure of Arabic text while only handling technical elements like URLs and social media mentions.

Q: What are the recommended use cases?

The model is particularly suitable for Arabic text processing tasks requiring diacritic sensitivity. For fine-tuning, it's recommended to enable dropout (recommended rate: 0.1) as the pre-training was done with dropout turned off.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.