GME-Qwen2-VL-2B-Instruct

Property	Value
Model Size	2.21B parameters
Embedding Dimension	1536
Max Sequence Length	32768
Developer	Alibaba-NLP (Tongyi Lab)
Paper	GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

What is gme-Qwen2-VL-2B-Instruct?

GME-Qwen2-VL-2B-Instruct is a powerful multimodal embedding model that represents a significant advancement in unified multimodal representation learning. Built by Alibaba's Tongyi Lab, this model can process text, images, and image-text pairs to produce universal vector representations, enabling sophisticated retrieval operations across different modalities.

Implementation Details

The model leverages the advanced Qwen2-VL architecture and supports dynamic resolution image input. It processes up to 1024 visual tokens per image and generates embeddings in a 1536-dimensional space. The model achieves impressive performance scores with 65.27 on MTEB-en, 66.92 on MTEB-zh, and 64.45 on UMRB benchmarks.

Supports three input types: text, image, and image-text pairs
Enables Any2Any Search capabilities across modalities
Features dynamic image resolution support
Implements efficient visual token processing

Core Capabilities

Universal multimodal retrieval across text and images
Strong performance in visual document retrieval tasks
Excellent support for multimodal RAG applications
State-of-the-art results in unified multimodal retrieval benchmark
Enhanced document understanding for academic papers

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to generate unified vector representations for different modalities, combined with its state-of-the-art performance in multimodal retrieval tasks, sets it apart. It's particularly notable for its strong visual document retrieval capabilities and support for dynamic image resolutions.

Q: What are the recommended use cases?

The model excels in multimodal retrieval tasks, document understanding, and multimodal RAG applications. It's particularly well-suited for academic paper analysis, cross-modal search operations, and applications requiring sophisticated document understanding capabilities.