SILMA Embedding STS v0.1

Property	Value
Parameter Count	135M
Output Dimensions	768
Max Sequence Length	512 tokens
License	Apache 2.0
Languages	Arabic, English

What is silma-embeddding-sts-v0.1?

SILMA Embedding STS is a specialized sentence transformer model designed for generating high-quality semantic embeddings for both Arabic and English text. Built on the foundation of bert-base-arabertv02, this model has been fine-tuned through a two-phase process to excel at semantic textual similarity tasks.

Implementation Details

The model employs a sophisticated architecture that generates 768-dimensional dense vector representations of input text, utilizing cosine similarity for comparing embeddings. It was trained using a two-phase approach: first on a dataset of 2.25M triplets, then fine-tuned on 30k sentence pairs with similarity scores.

Base Architecture: bert-base-arabertv02
Training Framework: Sentence Transformers 3.2.0
Optimization: Mixed precision training (BF16)
Evaluation Metrics: Achieved 85.59% Spearman correlation on Arabic STS tasks

Core Capabilities

Bilingual semantic similarity assessment
Cross-lingual text comparison
Semantic search implementation
Text classification and clustering
Question-answer matching

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its strong performance in both Arabic and English semantic tasks, achieving particularly impressive results on Arabic STS tasks (85.58% Spearman correlation). It's specifically optimized for production use with efficient inference capabilities.

Q: What are the recommended use cases?

The model excels in applications requiring semantic understanding such as text similarity comparison, document clustering, semantic search, and intent classification. It's particularly effective for Arabic language processing while maintaining good performance for English content.

silma-embeddding-sts-v0.1