simlm-msmarco-reranker

intfloat

A pre-trained language model for dense passage retrieval using a bottleneck architecture, achieving 43.8 MRR@10 on MS-MARCO with efficient representation learning.

Property	Value
License	MIT
Paper	View Paper
Language	English
Framework	PyTorch

What is simlm-msmarco-reranker?

SimLM is an innovative pre-training method for dense passage retrieval that employs a bottleneck architecture to compress passage information into dense vectors. Developed by Microsoft researchers, this model achieves impressive results on the MS-MARCO passage ranking task, with a dev MRR@10 of 43.8 and outperforming more complex multi-vector approaches.

Implementation Details

The model utilizes a replaced language modeling objective inspired by ELECTRA, improving sample efficiency and reducing the pre-training/fine-tuning distribution mismatch. It's implemented using the Transformers library and can process query-passage pairs with optional titles, producing relevance scores through a listwise loss training approach.

Maximum sequence length of 192 tokens
Supports batch processing for efficient inference
Implements sequence classification architecture
Uses ELECTRA-based architecture for better efficiency

Core Capabilities

Dense passage retrieval with state-of-the-art performance
Effective document reranking for search applications
High recall rates (98.6% R@1k on MS-MARCO dev set)
Strong performance on TREC DL tasks (74.6 nDCG@10 on TREC DL 2019)

Frequently Asked Questions

Q: What makes this model unique?

SimLM's uniqueness lies in its bottleneck architecture and self-supervised pre-training approach, which doesn't require labeled data or queries. It achieves superior performance while being more efficient than multi-vector approaches like ColBERTv2.

Q: What are the recommended use cases?

The model is ideal for information retrieval systems, search engine reranking, and document retrieval applications where high-quality passage ranking is required. It's particularly effective for scenarios requiring strong performance without extensive labeled data.