colbertv2-camembert-L4-mmarcoFR

Maintained By
antoinelouis

colbertv2-camembert-L4-mmarcoFR

PropertyValue
Parameter Count53.9M
LicenseMIT
LanguageFrench
Embedding Dimension32
Index Size9GB for 8.8M passages

What is colbertv2-camembert-L4-mmarcoFR?

This is a lightweight French language model based on ColBERTv2 architecture, specifically designed for semantic search applications. It represents a significant advancement in French language retrieval systems, offering an optimal balance between model size and performance. The model encodes queries and passages into matrices of token-level embeddings, enabling efficient semantic search through vector-similarity operations.

Implementation Details

The model is built on the camembert-L4 architecture and trained on the French portion of the mMARCO dataset, comprising 8.8M passages and 539K training queries. Training involved sophisticated negative sampling techniques with 62 hard negatives per query, utilizing cross-encoder distillation for enhanced performance.

  • Training utilized one 80GB NVIDIA H100 GPU for 325k steps
  • Implements AdamW optimizer with 1e-5 peak learning rate
  • Maximum sequence lengths: 32 tokens for questions, 160 for passages
  • 32-dimensional embeddings with cosine similarity scoring

Core Capabilities

  • Achieves 91.9% Recall@1000 on mMARCO-fr validation set
  • Efficient passage retrieval with compressed index size (9GB for 8.8M passages)
  • Optimized for French language semantic search
  • Seamless integration with RAGatouille and ColBERT-AI frameworks

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional efficiency-to-performance ratio, achieving comparable or better results than larger models while using only 53.9M parameters and 32-dimensional embeddings. This makes it particularly suitable for production environments where resource constraints are a concern.

Q: What are the recommended use cases?

The model is ideal for French language semantic search applications, particularly in scenarios requiring efficient passage retrieval from large document collections. It's especially suitable for RAG applications, digital libraries, and information retrieval systems focused on French content.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.