MMLW-Retrieval-RoBERTa-Large

Property	Value
Author	sdadas
Model Type	Dense Retrieval Model
Vector Dimensions	1024
Base Architecture	RoBERTa Large
Paper	PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods

What is mmlw-retrieval-roberta-large?

MMLW-retrieval-roberta-large is a specialized neural text encoder designed specifically for Polish language information retrieval tasks. The model transforms both queries and passages into 1024-dimensional vectors, enabling efficient semantic search capabilities. It represents a significant advancement in Polish language processing, achieved through a sophisticated two-step training procedure.

Implementation Details

The model's development follows a carefully crafted two-phase approach: Initial training using multilingual knowledge distillation on 60 million Polish-English text pairs, with English FlagEmbeddings (BGE) serving as teacher models, followed by fine-tuning on Polish MS MARCO data using contrastive loss. The training process utilized substantial computational resources, including a cluster of 12 A100 GPUs, with large batch sizes of 288 for the large model variant.

Requires specific query prefix "zapytanie: " for optimal performance
Achieves NDCG@10 of 58.46 on the Polish Information Retrieval Benchmark
Implements efficient vector encoding for both queries and passages
Utilizes state-of-the-art training techniques including knowledge distillation

Core Capabilities

High-quality semantic search in Polish language
Efficient text-to-vector transformation
Robust performance on information retrieval tasks
Optimized for both query and document encoding

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized optimization for Polish language retrieval tasks, combined with its sophisticated training approach using multilingual knowledge distillation and large-scale fine-tuning on Polish-specific datasets.

Q: What are the recommended use cases?

The model is particularly well-suited for information retrieval applications in Polish, including semantic search systems, document retrieval, and question-answering systems. It's specifically designed to handle both query encoding and document matching effectively.