Ovis1.6-Gemma2-9B

Property	Value
Parameter Count	10.2B
Model Type	Multimodal LLM
Architecture	SigLIP-400M + Gemma2-9B
License	Apache-2.0
Paper	arXiv:2405.20797

What is Ovis1.6-Gemma2-9B?

Ovis1.6-Gemma2-9B is an advanced multimodal large language model that combines visual and language processing capabilities. Built as part of the Ovis1.6 series, it represents a significant advancement in structurally aligning visual and textual embeddings. The model achieves state-of-the-art performance among open-source MLLMs within the 30B parameter range, despite having only 10.2B parameters.

Implementation Details

The model architecture integrates a SigLIP-400M vision encoder with a Gemma2-9B language model, enhanced through DPO training following instruction-tuning. It supports high-resolution image processing and operates with BF16 tensor precision for optimal performance.

Enhanced high-resolution image processing capabilities
Trained on a larger, more diverse dataset
Refined training process with DPO training
Supports batch inference for multiple images

Core Capabilities

Image-text understanding and generation
Multimodal conversation handling
High-performance visual reasoning
Support for multiple input formats

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to achieve leading performance on the OpenCompass benchmark with just 10B parameters, alongside its novel structural embedding alignment approach for multimodal processing, sets it apart from other MLLMs.

Q: What are the recommended use cases?

The model is particularly well-suited for tasks requiring image understanding and text generation, including image description, visual question answering, and multimodal conversation scenarios.

Ovis1.6-Gemma2-9B

Ovis1.6-Gemma2-9B

What is Ovis1.6-Gemma2-9B?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models