colbertv2-camembert-L4-mmarcoFR

Property	Value
Parameter Count	53.9M
License	MIT
Language	French
Embedding Dimension	32
Index Size	9GB for 8.8M passages

What is colbertv2-camembert-L4-mmarcoFR?

This is a lightweight French language model based on ColBERTv2 architecture, specifically designed for semantic search applications. It represents a significant advancement in French language retrieval systems, offering an optimal balance between model size and performance. The model encodes queries and passages into matrices of token-level embeddings, enabling efficient semantic search through vector-similarity operations.

Implementation Details

The model is built on the camembert-L4 architecture and trained on the French portion of the mMARCO dataset, comprising 8.8M passages and 539K training queries. Training involved sophisticated negative sampling techniques with 62 hard negatives per query, utilizing cross-encoder distillation for enhanced performance.

Training utilized one 80GB NVIDIA H100 GPU for 325k steps
Implements AdamW optimizer with 1e-5 peak learning rate
Maximum sequence lengths: 32 tokens for questions, 160 for passages
32-dimensional embeddings with cosine similarity scoring

Core Capabilities

Achieves 91.9% Recall@1000 on mMARCO-fr validation set
Efficient passage retrieval with compressed index size (9GB for 8.8M passages)
Optimized for French language semantic search
Seamless integration with RAGatouille and ColBERT-AI frameworks

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional efficiency-to-performance ratio, achieving comparable or better results than larger models while using only 53.9M parameters and 32-dimensional embeddings. This makes it particularly suitable for production environments where resource constraints are a concern.

Q: What are the recommended use cases?

The model is ideal for French language semantic search applications, particularly in scenarios requiring efficient passage retrieval from large document collections. It's especially suitable for RAG applications, digital libraries, and information retrieval systems focused on French content.