LLaVE-2B

LLaVE-2B

zhibinlan

LLaVE-2B: A 2B parameter multimodal embedding model based on Aquila-VL-2B, specializing in text-image-video embeddings with 4K token context window.

PropertyValue
Parameter Count2 Billion
Context Window4K tokens
Base ModelAquila-VL-2B
Training DataMMEB-train (662K pairs)
PaperarXiv:2503.04812

What is LLaVE-2B?

LLaVE-2B is a state-of-the-art multimodal embedding model designed to process and understand relationships between text, images, and videos. Built upon the Aquila-VL-2B architecture, it achieves remarkable performance in multimodal embedding tasks while using relatively modest computational resources and training data.

Implementation Details

The model is implemented using PyTorch and trained on 8 NVIDIA A100 GPUs. It utilizes a hardness-weighted contrastive learning approach and achieves top rankings on the MMEB leaderboard with only 662K training pairs.

  • Efficient architecture with 2B parameters
  • 4K token context window for comprehensive understanding
  • Zero-shot capabilities for text-video retrieval
  • Trained using Huggingface Trainer framework

Core Capabilities

  • Text-to-image embedding and retrieval
  • Image-to-text embedding and matching
  • Multi-image processing
  • Video understanding and embedding
  • Zero-shot generalization to new tasks

Frequently Asked Questions

Q: What makes this model unique?

LLaVE-2B stands out for its ability to achieve state-of-the-art performance on the MMEB leaderboard while using significantly fewer parameters and training data than competitors. Its zero-shot generalization to video tasks without specific video training is particularly noteworthy.

Q: What are the recommended use cases?

The model is ideal for applications requiring multimodal understanding, including image-text matching, visual search systems, content recommendation, and video retrieval tasks. It's particularly effective for scenarios where efficient embedding of multiple modalities is needed.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026