VLM2Vec-Full
Property | Value |
---|---|
Parameter Count | 4.15B |
License | Apache-2.0 |
Paper | arXiv:2410.05160 |
Tensor Type | BF16 |
Base Model | microsoft/Phi-3.5-vision-instruct |
What is VLM2Vec-Full?
VLM2Vec-Full is a sophisticated vision-language model designed to handle unified multimodal embedding tasks. Developed by TIGER-Lab, it represents a significant advancement in converting pre-trained vision-language models into powerful embedding models. The model is built upon Microsoft's Phi-3.5-vision-instruct architecture and has been trained on the comprehensive MMEB-train dataset.
Implementation Details
The model employs contrastive learning with in-batch negatives and was trained using a batch size of 2048 for full model training. It's implemented using PyTorch and the Transformers library, utilizing BF16 precision for optimal performance and memory efficiency.
- Trained on TIGER-Lab/MMEB-train dataset
- Evaluated across 36 different datasets
- Supports both image-to-text and text-to-image embedding tasks
- Uses custom pooling and normalization strategies
Core Capabilities
- Multimodal embedding generation for images and text
- Cross-modal similarity computation
- Support for various embedding tasks through a unified architecture
- Flexible input handling for both image and text queries
Frequently Asked Questions
Q: What makes this model unique?
VLM2Vec-Full stands out for its ability to significantly outperform existing baselines in multimodal embedding tasks, achieved through its innovative approach of converting a well-trained VLM into an embedding model. Its unified architecture allows it to handle various embedding tasks within a single framework.
Q: What are the recommended use cases?
The model is ideal for applications requiring cross-modal similarity matching, image-text retrieval, and general multimodal embedding generation. It's particularly useful for tasks that need to understand relationships between visual and textual content.