mmlw-retrieval-roberta-large

Maintained By
sdadas

MMLW-Retrieval-RoBERTa-Large

PropertyValue
Authorsdadas
Model TypeDense Retrieval Model
Vector Dimensions1024
Base ArchitectureRoBERTa Large
PaperPIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods

What is mmlw-retrieval-roberta-large?

MMLW-retrieval-roberta-large is a specialized neural text encoder designed specifically for Polish language information retrieval tasks. The model transforms both queries and passages into 1024-dimensional vectors, enabling efficient semantic search capabilities. It represents a significant advancement in Polish language processing, achieved through a sophisticated two-step training procedure.

Implementation Details

The model's development follows a carefully crafted two-phase approach: Initial training using multilingual knowledge distillation on 60 million Polish-English text pairs, with English FlagEmbeddings (BGE) serving as teacher models, followed by fine-tuning on Polish MS MARCO data using contrastive loss. The training process utilized substantial computational resources, including a cluster of 12 A100 GPUs, with large batch sizes of 288 for the large model variant.

  • Requires specific query prefix "zapytanie: " for optimal performance
  • Achieves NDCG@10 of 58.46 on the Polish Information Retrieval Benchmark
  • Implements efficient vector encoding for both queries and passages
  • Utilizes state-of-the-art training techniques including knowledge distillation

Core Capabilities

  • High-quality semantic search in Polish language
  • Efficient text-to-vector transformation
  • Robust performance on information retrieval tasks
  • Optimized for both query and document encoding

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized optimization for Polish language retrieval tasks, combined with its sophisticated training approach using multilingual knowledge distillation and large-scale fine-tuning on Polish-specific datasets.

Q: What are the recommended use cases?

The model is particularly well-suited for information retrieval applications in Polish, including semantic search systems, document retrieval, and question-answering systems. It's specifically designed to handle both query encoding and document matching effectively.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.