BGE-VL-MLLM-S2
Property | Value |
---|---|
Developer | BAAI |
License | MIT License |
Paper | MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval |
Release Date | March 2024 |
What is BGE-VL-MLLM-S2?
BGE-VL-MLLM-S2 is an advanced multimodal retrieval model that builds upon its predecessor S1 with additional fine-tuning on the MMEB benchmark training set. It represents a significant advancement in universal multimodal retrieval, leveraging the innovative MegaPairs dataset containing over 26 million heterogeneous KNN triplets.
Implementation Details
The model is implemented using the Transformers library and can process both images and text inputs. It utilizes a sophisticated architecture that enables efficient multimodal embedding and retrieval capabilities.
- Built on the MegaPairs dataset with additional MMEB fine-tuning
- Supports both image-to-image and text-image-to-image retrieval
- Implements normalized embeddings for accurate similarity scoring
- Provides comprehensive API for both query and candidate processing
Core Capabilities
- State-of-the-art performance in composed image retrieval
- Enhanced multimodal embedding across diverse tasks
- 8.1% improvement over previous SOTA on CIRCO benchmark
- Superior performance on MMEB out-of-distribution tasks
- Efficient processing of both images and text queries
Frequently Asked Questions
Q: What makes this model unique?
BGE-VL-MLLM-S2 stands out for its superior performance in multimodal retrieval tasks, achieving state-of-the-art results while maintaining efficient parameter usage. It's particularly notable for its additional fine-tuning on MMEB, which enhances its performance across a broader range of multimodal embedding tasks.
Q: What are the recommended use cases?
The model excels in composed image retrieval, multimodal search applications, and general-purpose image-text embedding tasks. It's particularly well-suited for applications requiring precise image retrieval based on both visual and textual queries.