silma-embeddding-sts-v0.1

Property	Value
Parameter Count	135M
Model Type	Sentence Transformer
Architecture	BERT-based (arabertv02)
License	Apache 2.0
Languages	Arabic, English
Output Dimension	768

What is silma-embeddding-sts-v0.1?

silma-embeddding-sts-v0.1 is a specialized bilingual sentence transformer model designed for semantic textual similarity tasks in Arabic and English. Built on the arabertv02 architecture, this model maps sentences and paragraphs to a 768-dimensional dense vector space, enabling various NLP tasks like semantic search, paraphrase detection, and text classification.

Implementation Details

The model underwent a two-phase training process: first fine-tuned on 2.25M triplets of Arabic/English samples, then further refined on 30k sentence pairs with similarity scores. It achieves impressive performance on Arabic semantic similarity tasks, scoring 85.6% on the STS17 Arabic benchmark.

Maximum sequence length: 512 tokens
Similarity measure: Cosine similarity
Training framework: Sentence-Transformers 3.2.0
Hardware optimization: BF16 precision support

Core Capabilities

Cross-lingual semantic similarity between Arabic and English texts
Short and long sentence comparison
Question-to-paragraph matching
Intent classification and mapping
Semantic search functionality

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its strong performance on Arabic language tasks while maintaining English capability, achieving 85.6% on Arabic STS17 benchmarks. Its two-phase training approach ensures robust cross-lingual understanding.

Q: What are the recommended use cases?

The model excels in bilingual applications including semantic search, content similarity matching, and intent classification. It's particularly suitable for Arabic-English cross-lingual applications and Arabic-specific NLP tasks.