colbertv2-camembert-L4-mmarcoFR
Property | Value |
---|---|
Parameter Count | 53.9M |
License | MIT |
Language | French |
Embedding Dimension | 32 |
Index Size | 9GB for 8.8M passages |
What is colbertv2-camembert-L4-mmarcoFR?
This is a lightweight French language model based on ColBERTv2 architecture, specifically designed for semantic search applications. It represents a significant advancement in French language retrieval systems, offering an optimal balance between model size and performance. The model encodes queries and passages into matrices of token-level embeddings, enabling efficient semantic search through vector-similarity operations.
Implementation Details
The model is built on the camembert-L4 architecture and trained on the French portion of the mMARCO dataset, comprising 8.8M passages and 539K training queries. Training involved sophisticated negative sampling techniques with 62 hard negatives per query, utilizing cross-encoder distillation for enhanced performance.
- Training utilized one 80GB NVIDIA H100 GPU for 325k steps
- Implements AdamW optimizer with 1e-5 peak learning rate
- Maximum sequence lengths: 32 tokens for questions, 160 for passages
- 32-dimensional embeddings with cosine similarity scoring
Core Capabilities
- Achieves 91.9% Recall@1000 on mMARCO-fr validation set
- Efficient passage retrieval with compressed index size (9GB for 8.8M passages)
- Optimized for French language semantic search
- Seamless integration with RAGatouille and ColBERT-AI frameworks
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its exceptional efficiency-to-performance ratio, achieving comparable or better results than larger models while using only 53.9M parameters and 32-dimensional embeddings. This makes it particularly suitable for production environments where resource constraints are a concern.
Q: What are the recommended use cases?
The model is ideal for French language semantic search applications, particularly in scenarios requiring efficient passage retrieval from large document collections. It's especially suitable for RAG applications, digital libraries, and information retrieval systems focused on French content.