sentence-bert-swedish-cased

Property	Value
Developer	KBLab (National Library of Sweden)
Model Type	Sentence Transformer
Vector Dimension	768
Max Sequence Length	384 tokens (v2.0)
Teacher Model	all-mpnet-base-v2

What is sentence-bert-swedish-cased?

sentence-bert-swedish-cased is a specialized bilingual transformer model designed to create high-quality sentence embeddings for Swedish and English text. Developed by KBLab, it converts sentences and paragraphs into 768-dimensional dense vectors, enabling advanced semantic search, clustering, and similarity analysis. The model employs knowledge distillation techniques, learning from the powerful all-mpnet-base-v2 teacher model while using KB-BERT as the student model.

Implementation Details

The model was trained on approximately 14.6 million sentences from English-Swedish parallel corpora, including data from JW300, Europarl, DGT-TM, EMEA, and other sources. It uses a mean pooling architecture and achieves impressive performance metrics, with v2.0 showing a Pearson correlation of 0.9283 on the SweParaphrase benchmark.

Trained using AdamW optimizer with learning rate 8e-06
Implements warmup linear scheduling with 5000 warmup steps
Uses mean pooling for sentence embedding generation
Supports both sentence-transformers and HuggingFace implementations

Core Capabilities

Semantic similarity assessment between Swedish texts
Cross-lingual embedding generation
Document clustering and classification
Information retrieval with 67.27% accuracy on SweFAQ dev set
Zero-shot transfer learning capabilities

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Swedish language understanding while maintaining cross-lingual capabilities with English. It's trained using knowledge distillation from one of the strongest available English models, making it particularly effective for Swedish NLP tasks while maintaining good performance on English text.

Q: What are the recommended use cases?

The model excels in semantic search applications, document similarity comparison, clustering of Swedish texts, and FAQ matching systems. It's particularly suitable for applications requiring understanding of semantic relationships between Swedish sentences or paragraphs.