Ovis1.6-Gemma2-9B
Property | Value |
---|---|
Parameter Count | 10.2B |
Model Type | Multimodal LLM |
Architecture | SigLIP-400M + Gemma2-9B |
License | Apache-2.0 |
Paper | arXiv:2405.20797 |
What is Ovis1.6-Gemma2-9B?
Ovis1.6-Gemma2-9B is an advanced multimodal large language model that combines visual and language processing capabilities. Built as part of the Ovis1.6 series, it represents a significant advancement in structurally aligning visual and textual embeddings. The model achieves state-of-the-art performance among open-source MLLMs within the 30B parameter range, despite having only 10.2B parameters.
Implementation Details
The model architecture integrates a SigLIP-400M vision encoder with a Gemma2-9B language model, enhanced through DPO training following instruction-tuning. It supports high-resolution image processing and operates with BF16 tensor precision for optimal performance.
- Enhanced high-resolution image processing capabilities
- Trained on a larger, more diverse dataset
- Refined training process with DPO training
- Supports batch inference for multiple images
Core Capabilities
- Image-text understanding and generation
- Multimodal conversation handling
- High-performance visual reasoning
- Support for multiple input formats
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to achieve leading performance on the OpenCompass benchmark with just 10B parameters, alongside its novel structural embedding alignment approach for multimodal processing, sets it apart from other MLLMs.
Q: What are the recommended use cases?
The model is particularly well-suited for tasks requiring image understanding and text generation, including image description, visual question answering, and multimodal conversation scenarios.