gme-Qwen2-VL-2B-Instruct

Maintained By
Alibaba-NLP

GME-Qwen2-VL-2B-Instruct

PropertyValue
Model Size2.21B parameters
Embedding Dimension1536
Max Sequence Length32768
DeveloperAlibaba-NLP (Tongyi Lab)
PaperGME: Improving Universal Multimodal Retrieval by Multimodal LLMs

What is gme-Qwen2-VL-2B-Instruct?

GME-Qwen2-VL-2B-Instruct is a powerful multimodal embedding model that represents a significant advancement in unified multimodal representation learning. Built by Alibaba's Tongyi Lab, this model can process text, images, and image-text pairs to produce universal vector representations, enabling sophisticated retrieval operations across different modalities.

Implementation Details

The model leverages the advanced Qwen2-VL architecture and supports dynamic resolution image input. It processes up to 1024 visual tokens per image and generates embeddings in a 1536-dimensional space. The model achieves impressive performance scores with 65.27 on MTEB-en, 66.92 on MTEB-zh, and 64.45 on UMRB benchmarks.

  • Supports three input types: text, image, and image-text pairs
  • Enables Any2Any Search capabilities across modalities
  • Features dynamic image resolution support
  • Implements efficient visual token processing

Core Capabilities

  • Universal multimodal retrieval across text and images
  • Strong performance in visual document retrieval tasks
  • Excellent support for multimodal RAG applications
  • State-of-the-art results in unified multimodal retrieval benchmark
  • Enhanced document understanding for academic papers

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to generate unified vector representations for different modalities, combined with its state-of-the-art performance in multimodal retrieval tasks, sets it apart. It's particularly notable for its strong visual document retrieval capabilities and support for dynamic image resolutions.

Q: What are the recommended use cases?

The model excels in multimodal retrieval tasks, document understanding, and multimodal RAG applications. It's particularly well-suited for academic paper analysis, cross-modal search operations, and applications requiring sophisticated document understanding capabilities.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.