CodeT5+ 110M Embedding Model
Property | Value |
---|---|
Developer | Salesforce |
License | BSD-3-Clause |
Paper | CodeT5+: Open Code Large Language Models |
Embedding Size | 256 dimensions |
What is codet5p-110m-embedding?
CodeT5+ 110M embedding is a specialized code embedding model developed by Salesforce that generates fixed-length (256-dimensional) vector representations of source code. It's built on the CodeT5+ architecture, which combines encoder-decoder frameworks with multiple pretraining objectives including span denoising, causal language modeling, and contrastive learning.
Implementation Details
The model consists of an encoder from CodeT5+ 220M model and a projection layer, trained through a two-stage process on both unimodal and bimodal data. It supports 9 programming languages including Python, Java, JavaScript, Go, Ruby, PHP, C, C++, and C#.
- Utilizes the transformers library with AutoModel functionality
- Compatible with the CodeT5 tokenizer
- Trained on permissively licensed code from GitHub
- Produces normalized 256-dimensional embeddings
Core Capabilities
- Zero-shot code retrieval with 74.23% overall accuracy across languages
- Strong performance on language-specific retrieval (90.69% for Go, 74.51% for Ruby)
- Efficient code representation for similarity matching and search
- Cross-language code understanding
Frequently Asked Questions
Q: What makes this model unique?
The model uniquely combines multiple pretraining objectives and uses a compute-efficient pretraining method with frozen off-the-shelf LLMs. It's specifically designed for code embedding with strong cross-language performance.
Q: What are the recommended use cases?
The model is ideal for code retrieval tasks, code similarity detection, and code search applications. It's particularly effective when you need to convert code snippets into meaningful vector representations for downstream tasks.