ColBERT-XM

Property	Value
Parameter Count	853M
Model Type	Multi-vector Retrieval
License	MIT
Paper	arXiv:2402.15059
Languages	81 languages

What is ColBERT-XM?

ColBERT-XM is a sophisticated multilingual semantic search model that leverages token-level embeddings for efficient and accurate passage retrieval. Built on the XMOD backbone, it uniquely combines late-interaction architecture with language-specific adapters to enable zero-shot cross-lingual retrieval capabilities.

Implementation Details

The model employs a 277M parameter architecture trained on MS MARCO passage ranking dataset with 6.4M training triples. It uses a combination of pairwise softmax cross-entropy loss and in-batch sampled softmax cross-entropy loss for optimization. The implementation features 128-dimensional embeddings with maximum sequence lengths of 32 for questions and 256 for passages.

Trained on 80GB NVIDIA H100 GPU for 50k steps
AdamW optimizer with 3e-6 peak learning rate
Batch size of 128 with 10% warmup steps
Uses hard negatives from 12 distinct dense retrievers

Core Capabilities

Zero-shot multilingual retrieval across 81 languages
State-of-the-art performance on mMARCO and Mr. TyDi benchmarks
Efficient token-level representation using MaxSim operators
Strong performance in both high and low-resource languages

Frequently Asked Questions

Q: What makes this model unique?

ColBERT-XM's distinctive feature is its ability to perform efficient multilingual retrieval using token-level representations while maintaining strong performance across diverse languages without requiring language-specific training data. The model achieves this through its innovative use of the XMOD backbone and language-specific adapters.

Q: What are the recommended use cases?

The model is ideal for multilingual information retrieval systems, cross-lingual search applications, and digital libraries requiring semantic search capabilities across multiple languages. It's particularly valuable for organizations needing to handle queries in multiple languages without maintaining separate models for each language.

colbert-xm