BGE-VL-MLLM-S1

BAAI

Multimodal retrieval model achieving SOTA performance in composed image retrieval, trained on MegaPairs dataset with 26M+ triplets. Excels in zero-shot tasks.

Property	Value
Author	BAAI
License	MIT License
Paper	MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Model URL	https://huggingface.co/BAAI/BGE-VL-MLLM-S1

What is BGE-VL-MLLM-S1?

BGE-VL-MLLM-S1 is a state-of-the-art multimodal retrieval model trained exclusively on the MegaPairs dataset, which contains over 26 million heterogeneous KNN triplets. The model demonstrates exceptional performance in composed image retrieval tasks, achieving an 8.1% improvement on the CIRCO benchmark (mAP@5) compared to previous SOTA models.

Implementation Details

The model utilizes a sophisticated architecture optimized for multimodal retrieval tasks. It can process both images and text inputs, making it particularly effective for composed image retrieval scenarios. The implementation supports both query and candidate processing modes, with built-in normalization and embedding generation capabilities.

Built-in processor for handling both image and text inputs
CUDA-compatible for enhanced performance
Supports batch processing of multiple images
Generates normalized embeddings for accurate similarity matching

Core Capabilities

Zero-shot composed image retrieval with SOTA performance
Efficient processing of multimodal inputs
Robust generalization across various retrieval tasks
Advanced similarity scoring between queries and candidates

Frequently Asked Questions

Q: What makes this model unique?

BGE-VL-MLLM-S1's uniqueness lies in its training on the MegaPairs dataset and its ability to achieve state-of-the-art performance in composed image retrieval without requiring additional fine-tuning. It demonstrates exceptional zero-shot capabilities while maintaining a relatively efficient architecture.

Q: What are the recommended use cases?

The model excels in multimodal retrieval tasks, particularly in scenarios requiring composed image retrieval where both text and image inputs need to be processed. It's ideal for applications like visual search engines, content recommendation systems, and image-text matching tasks.