bert-base-arabertv2

Maintained By
aubmindlab

bert-base-arabertv2

PropertyValue
Model Size543MB
Parameters136M
Training Data200M sentences (77GB)
Authoraubmindlab
Pre-segmentationYes (Farasa)

What is bert-base-arabertv2?

bert-base-arabertv2 is an advanced Arabic language model based on BERT's architecture, representing a significant improvement over its predecessors. It's trained on a massive dataset of 8.6B words, incorporating pre-segmented text using the Farasa Segmenter for enhanced Arabic language understanding.

Implementation Details

The model features improved preprocessing and a refined wordpiece vocabulary that better handles punctuation and numbers. It was trained using TPUv3-8 hardware for 3M steps, utilizing both sequence lengths of 128 and 512 tokens.

  • Implements advanced text segmentation with Farasa
  • Uses enhanced preprocessing for better handling of Arabic text
  • Trained on diverse sources including OSCAR corpus, Wikipedia, and news articles
  • Supports fast tokenizer implementation from the transformers library

Core Capabilities

  • Sentiment Analysis across multiple datasets (HARD, ASTD-Balanced, ArsenTD-Lev, LABR)
  • Named Entity Recognition with ANERcorp
  • Arabic Question Answering on Arabic-SQuAD and ARCD
  • Advanced text preprocessing and segmentation

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its larger training dataset (3.5x more data than previous versions), improved preprocessing pipeline, and specialized Arabic text segmentation. It's specifically optimized for Arabic language tasks with better handling of numbers and punctuation.

Q: What are the recommended use cases?

The model excels in various Arabic NLP tasks including sentiment analysis, named entity recognition, and question answering. It's particularly suitable for applications requiring deep understanding of Arabic text with proper handling of prefixes and suffixes.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.