GME-Qwen2-VL-7B-Instruct
Property | Value |
---|---|
Parameter Count | 8.29B |
Maximum Sequence Length | 32,768 |
Embedding Dimension | 3,584 |
Developer | Alibaba-NLP (Tongyi Lab) |
Paper | GME: Improving Universal Multimodal Retrieval by Multimodal LLMs |
What is GME-Qwen2-VL-7B-Instruct?
GME-Qwen2-VL-7B-Instruct is a state-of-the-art unified multimodal embedding model developed by Alibaba's Tongyi Lab. It represents a significant advancement in multimodal AI, capable of processing text, images, and image-text pairs to produce universal vector representations. The model achieves impressive performance scores, with 67.48 on MTEB-en, 71.36 on MTEB-zh, and 67.44 on UMRB benchmarks.
Implementation Details
Built upon the Qwen2-VL architecture, this model supports dynamic resolution image input and can process up to 1024 visual tokens per image. It features a robust embedding dimension of 3,584 and can handle sequences up to 32,768 tokens in length.
- Unified multimodal representation supporting Any2Any search capabilities
- Dynamic image resolution processing
- Enhanced visual document retrieval performance
- Support for complex document understanding scenarios
Core Capabilities
- Text-to-text, image-to-image, and cross-modal retrieval
- Strong performance in visual document retrieval tasks
- Multimodal retrieval-augmented generation (RAG) applications
- Universal vector representation generation for varied input types
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to process multiple input types (text, image, image-text pairs) and produce unified vector representations sets it apart, making it particularly valuable for universal multimodal retrieval tasks. Its performance on benchmarks exceeds many existing solutions.
Q: What are the recommended use cases?
The model excels in academic paper analysis, multimodal RAG applications, and complex document understanding scenarios. It's particularly suited for applications requiring sophisticated cross-modal search and retrieval capabilities.