AraModernBert-Base-V1.0

Property	Value
Parameters	~149M
Context Length	8,192 tokens
Architecture	ModernBERT
Vocabulary Size	50,280 tokens
Model Type	Transformer (ModernBert)
Developer	NAMAA-Space

What is AraModernBert-Base-V1.0?

AraModernBert-Base-V1.0 is an advanced Arabic language model that combines the innovative ModernBERT architecture with specialized Arabic language processing capabilities. Trained on 100 GigaBytes of Arabic text, it features a custom tokenizer with 50,280 tokens and employs the novel Trans-tokenization technique for optimal embedding layer initialization.

Implementation Details

The model implements a sophisticated architecture with 22 transformer layers, each with 768 hidden dimensions. It utilizes an alternating attention mechanism, combining global attention every 3 layers with a local attention window of 128 tokens. The model employs Rotary Positional Embeddings (RoPE) with different theta values for global (160000.0) and local (10000.0) attention.

22 transformer layers with 768 hidden dimensions
12 attention heads
8,192 token context window
Alternating attention mechanism
Specialized Arabic vocabulary

Core Capabilities

Text Classification (94.32% accuracy)
Named Entity Recognition (90.39% accuracy)
Semantic Textual Similarity (STS17: 0.831, STS22: 0.617)
Information Retrieval
RAG (Retrieval Augmented Generation)
Document Similarity Analysis

Frequently Asked Questions

Q: What makes this model unique?

AraModernBert combines the advanced ModernBERT architecture with specialized Arabic language processing capabilities, featuring a unique Trans-tokenization approach and extensive training on Arabic text. Its alternating attention mechanism and large context window make it particularly effective for long-form Arabic text processing.

Q: What are the recommended use cases?

The model excels in tasks including text classification, named entity recognition, and semantic similarity analysis. It's particularly well-suited for Modern Standard Arabic text processing, though performance may vary with dialectal Arabic variants.