mmlw-retrieval-roberta-large

mmlw-retrieval-roberta-large

sdadas

Polish RoBERTa-based retrieval model optimized for semantic search, featuring 1024-dimensional vectors and trained via knowledge distillation from English models.

PropertyValue
Authorsdadas
Model TypeDense Retrieval Model
Vector Dimensions1024
Base ArchitectureRoBERTa Large
PaperPIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods

What is mmlw-retrieval-roberta-large?

MMLW-retrieval-roberta-large is a specialized neural text encoder designed specifically for Polish language information retrieval tasks. The model transforms both queries and passages into 1024-dimensional vectors, enabling efficient semantic search capabilities. It represents a significant advancement in Polish language processing, achieved through a sophisticated two-step training procedure.

Implementation Details

The model's development follows a carefully crafted two-phase approach: Initial training using multilingual knowledge distillation on 60 million Polish-English text pairs, with English FlagEmbeddings (BGE) serving as teacher models, followed by fine-tuning on Polish MS MARCO data using contrastive loss. The training process utilized substantial computational resources, including a cluster of 12 A100 GPUs, with large batch sizes of 288 for the large model variant.

  • Requires specific query prefix "zapytanie: " for optimal performance
  • Achieves NDCG@10 of 58.46 on the Polish Information Retrieval Benchmark
  • Implements efficient vector encoding for both queries and passages
  • Utilizes state-of-the-art training techniques including knowledge distillation

Core Capabilities

  • High-quality semantic search in Polish language
  • Efficient text-to-vector transformation
  • Robust performance on information retrieval tasks
  • Optimized for both query and document encoding

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized optimization for Polish language retrieval tasks, combined with its sophisticated training approach using multilingual knowledge distillation and large-scale fine-tuning on Polish-specific datasets.

Q: What are the recommended use cases?

The model is particularly well-suited for information retrieval applications in Polish, including semantic search systems, document retrieval, and question-answering systems. It's specifically designed to handle both query encoding and document matching effectively.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026