colbert-xm

Maintained By
antoinelouis

ColBERT-XM

PropertyValue
Parameter Count853M
Model TypeMulti-vector Retrieval
LicenseMIT
PaperarXiv:2402.15059
Languages81 languages

What is ColBERT-XM?

ColBERT-XM is a sophisticated multilingual semantic search model that leverages token-level embeddings for efficient and accurate passage retrieval. Built on the XMOD backbone, it uniquely combines late-interaction architecture with language-specific adapters to enable zero-shot cross-lingual retrieval capabilities.

Implementation Details

The model employs a 277M parameter architecture trained on MS MARCO passage ranking dataset with 6.4M training triples. It uses a combination of pairwise softmax cross-entropy loss and in-batch sampled softmax cross-entropy loss for optimization. The implementation features 128-dimensional embeddings with maximum sequence lengths of 32 for questions and 256 for passages.

  • Trained on 80GB NVIDIA H100 GPU for 50k steps
  • AdamW optimizer with 3e-6 peak learning rate
  • Batch size of 128 with 10% warmup steps
  • Uses hard negatives from 12 distinct dense retrievers

Core Capabilities

  • Zero-shot multilingual retrieval across 81 languages
  • State-of-the-art performance on mMARCO and Mr. TyDi benchmarks
  • Efficient token-level representation using MaxSim operators
  • Strong performance in both high and low-resource languages

Frequently Asked Questions

Q: What makes this model unique?

ColBERT-XM's distinctive feature is its ability to perform efficient multilingual retrieval using token-level representations while maintaining strong performance across diverse languages without requiring language-specific training data. The model achieves this through its innovative use of the XMOD backbone and language-specific adapters.

Q: What are the recommended use cases?

The model is ideal for multilingual information retrieval systems, cross-lingual search applications, and digital libraries requiring semantic search capabilities across multiple languages. It's particularly valuable for organizations needing to handle queries in multiple languages without maintaining separate models for each language.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.