LLaVE-2B

Maintained By
zhibinlan

LLaVE-2B

PropertyValue
Parameter Count2 Billion
Context Window4K tokens
Base ModelAquila-VL-2B
Training DataMMEB-train (662K pairs)
PaperarXiv:2503.04812

What is LLaVE-2B?

LLaVE-2B is a state-of-the-art multimodal embedding model designed to process and understand relationships between text, images, and videos. Built upon the Aquila-VL-2B architecture, it achieves remarkable performance in multimodal embedding tasks while using relatively modest computational resources and training data.

Implementation Details

The model is implemented using PyTorch and trained on 8 NVIDIA A100 GPUs. It utilizes a hardness-weighted contrastive learning approach and achieves top rankings on the MMEB leaderboard with only 662K training pairs.

  • Efficient architecture with 2B parameters
  • 4K token context window for comprehensive understanding
  • Zero-shot capabilities for text-video retrieval
  • Trained using Huggingface Trainer framework

Core Capabilities

  • Text-to-image embedding and retrieval
  • Image-to-text embedding and matching
  • Multi-image processing
  • Video understanding and embedding
  • Zero-shot generalization to new tasks

Frequently Asked Questions

Q: What makes this model unique?

LLaVE-2B stands out for its ability to achieve state-of-the-art performance on the MMEB leaderboard while using significantly fewer parameters and training data than competitors. Its zero-shot generalization to video tasks without specific video training is particularly noteworthy.

Q: What are the recommended use cases?

The model is ideal for applications requiring multimodal understanding, including image-text matching, visual search systems, content recommendation, and video retrieval tasks. It's particularly effective for scenarios where efficient embedding of multiple modalities is needed.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.