VLM2Vec-Full

VLM2Vec-Full

TIGER-Lab

VLM2Vec-Full is a 4.15B parameter vision-language embedding model based on Phi-3.5-vision-instruct, trained for unified multimodal embedding tasks.

PropertyValue
Parameter Count4.15B
LicenseApache-2.0
PaperarXiv:2410.05160
Tensor TypeBF16
Base Modelmicrosoft/Phi-3.5-vision-instruct

What is VLM2Vec-Full?

VLM2Vec-Full is a sophisticated vision-language model designed to handle unified multimodal embedding tasks. Developed by TIGER-Lab, it represents a significant advancement in converting pre-trained vision-language models into powerful embedding models. The model is built upon Microsoft's Phi-3.5-vision-instruct architecture and has been trained on the comprehensive MMEB-train dataset.

Implementation Details

The model employs contrastive learning with in-batch negatives and was trained using a batch size of 2048 for full model training. It's implemented using PyTorch and the Transformers library, utilizing BF16 precision for optimal performance and memory efficiency.

  • Trained on TIGER-Lab/MMEB-train dataset
  • Evaluated across 36 different datasets
  • Supports both image-to-text and text-to-image embedding tasks
  • Uses custom pooling and normalization strategies

Core Capabilities

  • Multimodal embedding generation for images and text
  • Cross-modal similarity computation
  • Support for various embedding tasks through a unified architecture
  • Flexible input handling for both image and text queries

Frequently Asked Questions

Q: What makes this model unique?

VLM2Vec-Full stands out for its ability to significantly outperform existing baselines in multimodal embedding tasks, achieved through its innovative approach of converting a well-trained VLM into an embedding model. Its unified architecture allows it to handle various embedding tasks within a single framework.

Q: What are the recommended use cases?

The model is ideal for applications requiring cross-modal similarity matching, image-text retrieval, and general multimodal embedding generation. It's particularly useful for tasks that need to understand relationships between visual and textual content.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026