SILMA Embedding STS v0.1
Property | Value |
---|---|
Parameter Count | 135M |
Output Dimensions | 768 |
Max Sequence Length | 512 tokens |
License | Apache 2.0 |
Languages | Arabic, English |
What is silma-embeddding-sts-v0.1?
SILMA Embedding STS is a specialized sentence transformer model designed for generating high-quality semantic embeddings for both Arabic and English text. Built on the foundation of bert-base-arabertv02, this model has been fine-tuned through a two-phase process to excel at semantic textual similarity tasks.
Implementation Details
The model employs a sophisticated architecture that generates 768-dimensional dense vector representations of input text, utilizing cosine similarity for comparing embeddings. It was trained using a two-phase approach: first on a dataset of 2.25M triplets, then fine-tuned on 30k sentence pairs with similarity scores.
- Base Architecture: bert-base-arabertv02
- Training Framework: Sentence Transformers 3.2.0
- Optimization: Mixed precision training (BF16)
- Evaluation Metrics: Achieved 85.59% Spearman correlation on Arabic STS tasks
Core Capabilities
- Bilingual semantic similarity assessment
- Cross-lingual text comparison
- Semantic search implementation
- Text classification and clustering
- Question-answer matching
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its strong performance in both Arabic and English semantic tasks, achieving particularly impressive results on Arabic STS tasks (85.58% Spearman correlation). It's specifically optimized for production use with efficient inference capabilities.
Q: What are the recommended use cases?
The model excels in applications requiring semantic understanding such as text similarity comparison, document clustering, semantic search, and intent classification. It's particularly effective for Arabic language processing while maintaining good performance for English content.