GME-Qwen2-VL-2B-Instruct
Property | Value |
---|---|
Model Size | 2.21B parameters |
Embedding Dimension | 1536 |
Max Sequence Length | 32768 |
Developer | Alibaba-NLP (Tongyi Lab) |
Paper | GME: Improving Universal Multimodal Retrieval by Multimodal LLMs |
What is gme-Qwen2-VL-2B-Instruct?
GME-Qwen2-VL-2B-Instruct is a powerful multimodal embedding model that represents a significant advancement in unified multimodal representation learning. Built by Alibaba's Tongyi Lab, this model can process text, images, and image-text pairs to produce universal vector representations, enabling sophisticated retrieval operations across different modalities.
Implementation Details
The model leverages the advanced Qwen2-VL architecture and supports dynamic resolution image input. It processes up to 1024 visual tokens per image and generates embeddings in a 1536-dimensional space. The model achieves impressive performance scores with 65.27 on MTEB-en, 66.92 on MTEB-zh, and 64.45 on UMRB benchmarks.
- Supports three input types: text, image, and image-text pairs
- Enables Any2Any Search capabilities across modalities
- Features dynamic image resolution support
- Implements efficient visual token processing
Core Capabilities
- Universal multimodal retrieval across text and images
- Strong performance in visual document retrieval tasks
- Excellent support for multimodal RAG applications
- State-of-the-art results in unified multimodal retrieval benchmark
- Enhanced document understanding for academic papers
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to generate unified vector representations for different modalities, combined with its state-of-the-art performance in multimodal retrieval tasks, sets it apart. It's particularly notable for its strong visual document retrieval capabilities and support for dynamic image resolutions.
Q: What are the recommended use cases?
The model excels in multimodal retrieval tasks, document understanding, and multimodal RAG applications. It's particularly well-suited for academic paper analysis, cross-modal search operations, and applications requiring sophisticated document understanding capabilities.