Omnivision-968M

Property	Value
Parameter Count	474M
License	Apache-2.0
Model Type	Multimodal Vision-Language Model
Author	NexaAIDev

What is omnivision-968M?

Omnivision-968M is a compact multimodal model designed specifically for edge device deployment, combining visual and text processing capabilities. Built upon LLaVA's architecture, it achieves remarkable efficiency through innovative token reduction while maintaining high performance in visual-language tasks.

Implementation Details

The model architecture consists of three primary components: Qwen2.5-0.5B-Instruct as the base language model, SigLIP-400M as the vision encoder operating at 384 resolution, and a custom MLP projection layer that achieves 9x token reduction compared to traditional approaches. The model processes images at 14×14 patch size and reduces image tokens from 729 to 81.

Efficient Token Processing: 9x reduction in image tokens improving computational efficiency
DPO Training: Enhanced accuracy through Direct Preference Optimization
Compact Size: Only requires 988MB RAM and 948MB storage
Fast Processing: Sub-2 second inference time on M4 Pro Macbook

Core Capabilities

Visual Question Answering (VQA)
Image Captioning
Complex Image Understanding
Color and Detail Detection
Anime Recognition

Frequently Asked Questions

Q: What makes this model unique?

The model's standout feature is its ability to achieve high performance in visual-language tasks while maintaining a significantly reduced computational footprint through its 9x token reduction mechanism. This makes it particularly suitable for edge device deployment while still maintaining competitive performance against larger models.

Q: What are the recommended use cases?

The model is ideal for applications requiring on-device visual question answering and image captioning. It performs particularly well in scenarios requiring quick processing of visual information with minimal computational resources, making it perfect for mobile applications, edge devices, and situations where real-time response is crucial.

omnivision-968M