LLaVE-2B
Property | Value |
---|---|
Parameter Count | 2 Billion |
Context Window | 4K tokens |
Base Model | Aquila-VL-2B |
Training Data | MMEB-train (662K pairs) |
Paper | arXiv:2503.04812 |
What is LLaVE-2B?
LLaVE-2B is a state-of-the-art multimodal embedding model designed to process and understand relationships between text, images, and videos. Built upon the Aquila-VL-2B architecture, it achieves remarkable performance in multimodal embedding tasks while using relatively modest computational resources and training data.
Implementation Details
The model is implemented using PyTorch and trained on 8 NVIDIA A100 GPUs. It utilizes a hardness-weighted contrastive learning approach and achieves top rankings on the MMEB leaderboard with only 662K training pairs.
- Efficient architecture with 2B parameters
- 4K token context window for comprehensive understanding
- Zero-shot capabilities for text-video retrieval
- Trained using Huggingface Trainer framework
Core Capabilities
- Text-to-image embedding and retrieval
- Image-to-text embedding and matching
- Multi-image processing
- Video understanding and embedding
- Zero-shot generalization to new tasks
Frequently Asked Questions
Q: What makes this model unique?
LLaVE-2B stands out for its ability to achieve state-of-the-art performance on the MMEB leaderboard while using significantly fewer parameters and training data than competitors. Its zero-shot generalization to video tasks without specific video training is particularly noteworthy.
Q: What are the recommended use cases?
The model is ideal for applications requiring multimodal understanding, including image-text matching, visual search systems, content recommendation, and video retrieval tasks. It's particularly effective for scenarios where efficient embedding of multiple modalities is needed.