silma-embeddding-sts-v0.1
Property | Value |
---|---|
Parameter Count | 135M |
Model Type | Sentence Transformer |
Architecture | BERT-based (arabertv02) |
License | Apache 2.0 |
Languages | Arabic, English |
Output Dimension | 768 |
What is silma-embeddding-sts-v0.1?
silma-embeddding-sts-v0.1 is a specialized bilingual sentence transformer model designed for semantic textual similarity tasks in Arabic and English. Built on the arabertv02 architecture, this model maps sentences and paragraphs to a 768-dimensional dense vector space, enabling various NLP tasks like semantic search, paraphrase detection, and text classification.
Implementation Details
The model underwent a two-phase training process: first fine-tuned on 2.25M triplets of Arabic/English samples, then further refined on 30k sentence pairs with similarity scores. It achieves impressive performance on Arabic semantic similarity tasks, scoring 85.6% on the STS17 Arabic benchmark.
- Maximum sequence length: 512 tokens
- Similarity measure: Cosine similarity
- Training framework: Sentence-Transformers 3.2.0
- Hardware optimization: BF16 precision support
Core Capabilities
- Cross-lingual semantic similarity between Arabic and English texts
- Short and long sentence comparison
- Question-to-paragraph matching
- Intent classification and mapping
- Semantic search functionality
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its strong performance on Arabic language tasks while maintaining English capability, achieving 85.6% on Arabic STS17 benchmarks. Its two-phase training approach ensures robust cross-lingual understanding.
Q: What are the recommended use cases?
The model excels in bilingual applications including semantic search, content similarity matching, and intent classification. It's particularly suitable for Arabic-English cross-lingual applications and Arabic-specific NLP tasks.