Ovis1.6-Gemma2-9B

AIDC-AI

Ovis1.6-Gemma2-9B is a 10.2B parameter multimodal LLM that leads OpenCompass benchmark for models under 30B params, featuring Gemma architecture and SigLIP visual processing.

Property	Value
Parameter Count	10.2B
Model Type	Multimodal LLM
Architecture	Gemma2-9B + SigLIP-400M
License	Apache-2.0
Paper	arXiv:2405.20797

What is Ovis1.6-Gemma2-9B?

Ovis1.6-Gemma2-9B is an advanced multimodal large language model that combines text and image processing capabilities. Built upon the Ovis1.5 architecture, this model represents a significant advancement in high-resolution image processing and multimodal understanding. It utilizes a Gemma2-9B language model integrated with a SigLIP-400M vision transformer, creating a powerful system for image-text tasks.

Implementation Details

The model leverages a novel architectural approach that structurally aligns visual and textual embeddings. It supports batch processing and can handle images with text queries up to 8192 tokens in length. The implementation uses BF16 precision for optimal performance and efficiency.

Integrated SigLIP-400M visual processor for enhanced image understanding
Advanced DPO training following instruction-tuning
Supports high-resolution image processing
Implements efficient batch inference capabilities

Core Capabilities

Leading performance in OpenCompass benchmark for models under 30B parameters
Efficient image-text processing and generation
High-quality multimodal understanding and response generation
Flexible deployment options with comprehensive API support

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its efficient architecture that achieves state-of-the-art performance with just 10.2B parameters, leading the OpenCompass benchmark among open-source MLLMs within the 30B parameter range. Its structural embedding alignment approach for multimodal processing sets it apart from conventional architectures.

Q: What are the recommended use cases?

The model excels in image-text tasks including image description, visual question answering, and multimodal dialogue. It's particularly suitable for applications requiring high-quality image understanding and natural language interaction.