MMLW-RoBERTa-Base
Property | Value |
---|---|
Author | sdadas |
Architecture | RoBERTa-Base |
Output Dimensions | 768 |
MTEB Score | 61.05 |
Paper | arXiv:2402.13350 |
What is mmlw-roberta-base?
MMLW-RoBERTa-Base is a specialized neural text encoder designed specifically for Polish language processing. The model was developed through multilingual knowledge distillation, leveraging 60 million Polish-English text pairs and using English FlagEmbeddings (BGE) as teacher models. It generates 768-dimensional vectors that capture semantic relationships in text.
Implementation Details
The model requires specific prefixes for optimal performance, particularly "zapytanie: " for queries. It's built on a Polish RoBERTa checkpoint and has been optimized through knowledge distillation. The implementation leverages the sentence-transformers framework for easy integration into existing workflows.
- Achieves 61.05 average score on Polish Massive Text Embedding Benchmark (MTEB)
- NDCG@10 score of 53.60 on Polish Information Retrieval Benchmark
- Supports semantic similarity, clustering, and information retrieval tasks
Core Capabilities
- Text embedding generation for Polish language
- Semantic similarity computation
- Document clustering
- Information retrieval optimization
- Fine-tuning foundation for downstream tasks
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its specialized training for Polish language processing through multilingual knowledge distillation, making it particularly effective for Polish text embedding tasks while benefiting from English language knowledge transfer.
Q: What are the recommended use cases?
The model is ideal for tasks requiring semantic understanding of Polish text, including document similarity comparison, information retrieval, text clustering, and as a foundation for fine-tuning on specific Polish language tasks.