VLM2Vec-Full

Property	Value
Parameter Count	4.15B
License	Apache-2.0
Paper	arXiv:2410.05160
Tensor Type	BF16
Base Model	microsoft/Phi-3.5-vision-instruct

What is VLM2Vec-Full?

VLM2Vec-Full is a sophisticated vision-language model designed to handle unified multimodal embedding tasks. Developed by TIGER-Lab, it represents a significant advancement in converting pre-trained vision-language models into powerful embedding models. The model is built upon Microsoft's Phi-3.5-vision-instruct architecture and has been trained on the comprehensive MMEB-train dataset.

Implementation Details

The model employs contrastive learning with in-batch negatives and was trained using a batch size of 2048 for full model training. It's implemented using PyTorch and the Transformers library, utilizing BF16 precision for optimal performance and memory efficiency.

Trained on TIGER-Lab/MMEB-train dataset
Evaluated across 36 different datasets
Supports both image-to-text and text-to-image embedding tasks
Uses custom pooling and normalization strategies

Core Capabilities

Multimodal embedding generation for images and text
Cross-modal similarity computation
Support for various embedding tasks through a unified architecture
Flexible input handling for both image and text queries

Frequently Asked Questions

Q: What makes this model unique?

VLM2Vec-Full stands out for its ability to significantly outperform existing baselines in multimodal embedding tasks, achieved through its innovative approach of converting a well-trained VLM into an embedding model. Its unified architecture allows it to handle various embedding tasks within a single framework.

Q: What are the recommended use cases?

The model is ideal for applications requiring cross-modal similarity matching, image-text retrieval, and general multimodal embedding generation. It's particularly useful for tasks that need to understand relationships between visual and textual content.

VLM2Vec-Full

VLM2Vec-Full

What is VLM2Vec-Full?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models