SFR-Embedding-Code-2B_R
Property | Value |
---|---|
Model Size | 2B parameters |
Developer | Salesforce Research |
Paper | CodeXEmbed: A Generalist Embedding Model Family |
License | Research-only (Subject to Gemma Terms of Use) |
Performance | 67.4% NDCG@10 on CoIR Benchmark |
What is SFR-Embedding-Code-2B_R?
SFR-Embedding-Code-2B_R is a state-of-the-art embedding model designed for multilingual and multi-task code and text retrieval. Developed by Salesforce Research, it represents the largest model in the SFR-Embedding family, demonstrating superior performance compared to existing open-source code embedding models.
Implementation Details
The model can be easily implemented using either the Transformers or Sentence Transformers libraries. It supports a maximum sequence length of 32,768 tokens and requires specific instruction formatting for queries. The model architecture is based on Gemma and has been fine-tuned for code retrieval tasks.
- Built on Gemma architecture with 2B parameters
- Supports both code and text embedding generation
- Implements instruction-based query formatting
- Offers flexible integration through popular ML frameworks
Core Capabilities
- Multilingual code retrieval and understanding
- High-performance text-to-code matching
- Superior embedding quality (67.4% NDCG@10)
- Long context support up to 32K tokens
- Flexible API support for both Transformers and Sentence Transformers
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its superior performance in code retrieval tasks, achieving the highest NDCG@10 score (67.4%) among comparable models on the CoIR benchmark. It's also notable for its large parameter count and versatility in handling both code and text embedding tasks.
Q: What are the recommended use cases?
The model is specifically designed for research purposes in code retrieval, documentation matching, and code-text alignment tasks. It's particularly useful for applications requiring high-quality code search, code documentation linking, and similar code detection.