ColBERT-XM
Property | Value |
---|---|
Parameter Count | 853M |
Model Type | Multi-vector Retrieval |
License | MIT |
Paper | arXiv:2402.15059 |
Languages | 81 languages |
What is ColBERT-XM?
ColBERT-XM is a sophisticated multilingual semantic search model that leverages token-level embeddings for efficient and accurate passage retrieval. Built on the XMOD backbone, it uniquely combines late-interaction architecture with language-specific adapters to enable zero-shot cross-lingual retrieval capabilities.
Implementation Details
The model employs a 277M parameter architecture trained on MS MARCO passage ranking dataset with 6.4M training triples. It uses a combination of pairwise softmax cross-entropy loss and in-batch sampled softmax cross-entropy loss for optimization. The implementation features 128-dimensional embeddings with maximum sequence lengths of 32 for questions and 256 for passages.
- Trained on 80GB NVIDIA H100 GPU for 50k steps
- AdamW optimizer with 3e-6 peak learning rate
- Batch size of 128 with 10% warmup steps
- Uses hard negatives from 12 distinct dense retrievers
Core Capabilities
- Zero-shot multilingual retrieval across 81 languages
- State-of-the-art performance on mMARCO and Mr. TyDi benchmarks
- Efficient token-level representation using MaxSim operators
- Strong performance in both high and low-resource languages
Frequently Asked Questions
Q: What makes this model unique?
ColBERT-XM's distinctive feature is its ability to perform efficient multilingual retrieval using token-level representations while maintaining strong performance across diverse languages without requiring language-specific training data. The model achieves this through its innovative use of the XMOD backbone and language-specific adapters.
Q: What are the recommended use cases?
The model is ideal for multilingual information retrieval systems, cross-lingual search applications, and digital libraries requiring semantic search capabilities across multiple languages. It's particularly valuable for organizations needing to handle queries in multiple languages without maintaining separate models for each language.