MMLW-Retrieval-RoBERTa-Large
Property | Value |
---|---|
Author | sdadas |
Model Type | Dense Retrieval Model |
Vector Dimensions | 1024 |
Base Architecture | RoBERTa Large |
Paper | PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods |
What is mmlw-retrieval-roberta-large?
MMLW-retrieval-roberta-large is a specialized neural text encoder designed specifically for Polish language information retrieval tasks. The model transforms both queries and passages into 1024-dimensional vectors, enabling efficient semantic search capabilities. It represents a significant advancement in Polish language processing, achieved through a sophisticated two-step training procedure.
Implementation Details
The model's development follows a carefully crafted two-phase approach: Initial training using multilingual knowledge distillation on 60 million Polish-English text pairs, with English FlagEmbeddings (BGE) serving as teacher models, followed by fine-tuning on Polish MS MARCO data using contrastive loss. The training process utilized substantial computational resources, including a cluster of 12 A100 GPUs, with large batch sizes of 288 for the large model variant.
- Requires specific query prefix "zapytanie: " for optimal performance
- Achieves NDCG@10 of 58.46 on the Polish Information Retrieval Benchmark
- Implements efficient vector encoding for both queries and passages
- Utilizes state-of-the-art training techniques including knowledge distillation
Core Capabilities
- High-quality semantic search in Polish language
- Efficient text-to-vector transformation
- Robust performance on information retrieval tasks
- Optimized for both query and document encoding
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its specialized optimization for Polish language retrieval tasks, combined with its sophisticated training approach using multilingual knowledge distillation and large-scale fine-tuning on Polish-specific datasets.
Q: What are the recommended use cases?
The model is particularly well-suited for information retrieval applications in Polish, including semantic search systems, document retrieval, and question-answering systems. It's specifically designed to handle both query encoding and document matching effectively.