gme-Qwen2-VL-7B-Instruct

Maintained By
Alibaba-NLP

GME-Qwen2-VL-7B-Instruct

PropertyValue
Parameter Count8.29B
Maximum Sequence Length32,768
Embedding Dimension3,584
DeveloperAlibaba-NLP (Tongyi Lab)
PaperGME: Improving Universal Multimodal Retrieval by Multimodal LLMs

What is GME-Qwen2-VL-7B-Instruct?

GME-Qwen2-VL-7B-Instruct is a state-of-the-art unified multimodal embedding model developed by Alibaba's Tongyi Lab. It represents a significant advancement in multimodal AI, capable of processing text, images, and image-text pairs to produce universal vector representations. The model achieves impressive performance scores, with 67.48 on MTEB-en, 71.36 on MTEB-zh, and 67.44 on UMRB benchmarks.

Implementation Details

Built upon the Qwen2-VL architecture, this model supports dynamic resolution image input and can process up to 1024 visual tokens per image. It features a robust embedding dimension of 3,584 and can handle sequences up to 32,768 tokens in length.

  • Unified multimodal representation supporting Any2Any search capabilities
  • Dynamic image resolution processing
  • Enhanced visual document retrieval performance
  • Support for complex document understanding scenarios

Core Capabilities

  • Text-to-text, image-to-image, and cross-modal retrieval
  • Strong performance in visual document retrieval tasks
  • Multimodal retrieval-augmented generation (RAG) applications
  • Universal vector representation generation for varied input types

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process multiple input types (text, image, image-text pairs) and produce unified vector representations sets it apart, making it particularly valuable for universal multimodal retrieval tasks. Its performance on benchmarks exceeds many existing solutions.

Q: What are the recommended use cases?

The model excels in academic paper analysis, multimodal RAG applications, and complex document understanding scenarios. It's particularly suited for applications requiring sophisticated cross-modal search and retrieval capabilities.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.