Omnivision-968M
Property | Value |
---|---|
Parameter Count | 474M |
License | Apache-2.0 |
Model Type | Multimodal Vision-Language Model |
Author | NexaAIDev |
What is omnivision-968M?
Omnivision-968M is a compact multimodal model designed specifically for edge device deployment, combining visual and text processing capabilities. Built upon LLaVA's architecture, it achieves remarkable efficiency through innovative token reduction while maintaining high performance in visual-language tasks.
Implementation Details
The model architecture consists of three primary components: Qwen2.5-0.5B-Instruct as the base language model, SigLIP-400M as the vision encoder operating at 384 resolution, and a custom MLP projection layer that achieves 9x token reduction compared to traditional approaches. The model processes images at 14×14 patch size and reduces image tokens from 729 to 81.
- Efficient Token Processing: 9x reduction in image tokens improving computational efficiency
- DPO Training: Enhanced accuracy through Direct Preference Optimization
- Compact Size: Only requires 988MB RAM and 948MB storage
- Fast Processing: Sub-2 second inference time on M4 Pro Macbook
Core Capabilities
- Visual Question Answering (VQA)
- Image Captioning
- Complex Image Understanding
- Color and Detail Detection
- Anime Recognition
Frequently Asked Questions
Q: What makes this model unique?
The model's standout feature is its ability to achieve high performance in visual-language tasks while maintaining a significantly reduced computational footprint through its 9x token reduction mechanism. This makes it particularly suitable for edge device deployment while still maintaining competitive performance against larger models.
Q: What are the recommended use cases?
The model is ideal for applications requiring on-device visual question answering and image captioning. It performs particularly well in scenarios requiring quick processing of visual information with minimal computational resources, making it perfect for mobile applications, edge devices, and situations where real-time response is crucial.