CodeT5+ 110M Embedding Model

Property	Value
Developer	Salesforce
License	BSD-3-Clause
Paper	CodeT5+: Open Code Large Language Models
Embedding Size	256 dimensions

What is codet5p-110m-embedding?

CodeT5+ 110M embedding is a specialized code embedding model developed by Salesforce that generates fixed-length (256-dimensional) vector representations of source code. It's built on the CodeT5+ architecture, which combines encoder-decoder frameworks with multiple pretraining objectives including span denoising, causal language modeling, and contrastive learning.

Implementation Details

The model consists of an encoder from CodeT5+ 220M model and a projection layer, trained through a two-stage process on both unimodal and bimodal data. It supports 9 programming languages including Python, Java, JavaScript, Go, Ruby, PHP, C, C++, and C#.

Utilizes the transformers library with AutoModel functionality
Compatible with the CodeT5 tokenizer
Trained on permissively licensed code from GitHub
Produces normalized 256-dimensional embeddings

Core Capabilities

Zero-shot code retrieval with 74.23% overall accuracy across languages
Strong performance on language-specific retrieval (90.69% for Go, 74.51% for Ruby)
Efficient code representation for similarity matching and search
Cross-language code understanding

Frequently Asked Questions

Q: What makes this model unique?

The model uniquely combines multiple pretraining objectives and uses a compute-efficient pretraining method with frozen off-the-shelf LLMs. It's specifically designed for code embedding with strong cross-language performance.

Q: What are the recommended use cases?

The model is ideal for code retrieval tasks, code similarity detection, and code search applications. It's particularly effective when you need to convert code snippets into meaningful vector representations for downstream tasks.

codet5p-110m-embedding