VLM2Vec-Full

Maintained By
TIGER-Lab

VLM2Vec-Full

PropertyValue
Parameter Count4.15B
LicenseApache-2.0
PaperarXiv:2410.05160
Tensor TypeBF16
Base Modelmicrosoft/Phi-3.5-vision-instruct

What is VLM2Vec-Full?

VLM2Vec-Full is a sophisticated vision-language model designed to handle unified multimodal embedding tasks. Developed by TIGER-Lab, it represents a significant advancement in converting pre-trained vision-language models into powerful embedding models. The model is built upon Microsoft's Phi-3.5-vision-instruct architecture and has been trained on the comprehensive MMEB-train dataset.

Implementation Details

The model employs contrastive learning with in-batch negatives and was trained using a batch size of 2048 for full model training. It's implemented using PyTorch and the Transformers library, utilizing BF16 precision for optimal performance and memory efficiency.

  • Trained on TIGER-Lab/MMEB-train dataset
  • Evaluated across 36 different datasets
  • Supports both image-to-text and text-to-image embedding tasks
  • Uses custom pooling and normalization strategies

Core Capabilities

  • Multimodal embedding generation for images and text
  • Cross-modal similarity computation
  • Support for various embedding tasks through a unified architecture
  • Flexible input handling for both image and text queries

Frequently Asked Questions

Q: What makes this model unique?

VLM2Vec-Full stands out for its ability to significantly outperform existing baselines in multimodal embedding tasks, achieved through its innovative approach of converting a well-trained VLM into an embedding model. Its unified architecture allows it to handle various embedding tasks within a single framework.

Q: What are the recommended use cases?

The model is ideal for applications requiring cross-modal similarity matching, image-text retrieval, and general multimodal embedding generation. It's particularly useful for tasks that need to understand relationships between visual and textual content.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.