Ovis1.6-Gemma2-9B

Maintained By
AIDC-AI

Ovis1.6-Gemma2-9B

PropertyValue
Parameter Count10.2B
Model TypeMultimodal LLM
ArchitectureSigLIP-400M + Gemma2-9B
LicenseApache-2.0
PaperarXiv:2405.20797

What is Ovis1.6-Gemma2-9B?

Ovis1.6-Gemma2-9B is an advanced multimodal large language model that combines visual and language processing capabilities. Built as part of the Ovis1.6 series, it represents a significant advancement in structurally aligning visual and textual embeddings. The model achieves state-of-the-art performance among open-source MLLMs within the 30B parameter range, despite having only 10.2B parameters.

Implementation Details

The model architecture integrates a SigLIP-400M vision encoder with a Gemma2-9B language model, enhanced through DPO training following instruction-tuning. It supports high-resolution image processing and operates with BF16 tensor precision for optimal performance.

  • Enhanced high-resolution image processing capabilities
  • Trained on a larger, more diverse dataset
  • Refined training process with DPO training
  • Supports batch inference for multiple images

Core Capabilities

  • Image-text understanding and generation
  • Multimodal conversation handling
  • High-performance visual reasoning
  • Support for multiple input formats

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to achieve leading performance on the OpenCompass benchmark with just 10B parameters, alongside its novel structural embedding alignment approach for multimodal processing, sets it apart from other MLLMs.

Q: What are the recommended use cases?

The model is particularly well-suited for tasks requiring image understanding and text generation, including image description, visual question answering, and multimodal conversation scenarios.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.