omnivision-968M

Maintained By
NexaAIDev

Omnivision-968M

PropertyValue
Parameter Count474M
LicenseApache-2.0
Model TypeMultimodal Vision-Language Model
AuthorNexaAIDev

What is omnivision-968M?

Omnivision-968M is a compact multimodal model designed specifically for edge device deployment, combining visual and text processing capabilities. Built upon LLaVA's architecture, it achieves remarkable efficiency through innovative token reduction while maintaining high performance in visual-language tasks.

Implementation Details

The model architecture consists of three primary components: Qwen2.5-0.5B-Instruct as the base language model, SigLIP-400M as the vision encoder operating at 384 resolution, and a custom MLP projection layer that achieves 9x token reduction compared to traditional approaches. The model processes images at 14×14 patch size and reduces image tokens from 729 to 81.

  • Efficient Token Processing: 9x reduction in image tokens improving computational efficiency
  • DPO Training: Enhanced accuracy through Direct Preference Optimization
  • Compact Size: Only requires 988MB RAM and 948MB storage
  • Fast Processing: Sub-2 second inference time on M4 Pro Macbook

Core Capabilities

  • Visual Question Answering (VQA)
  • Image Captioning
  • Complex Image Understanding
  • Color and Detail Detection
  • Anime Recognition

Frequently Asked Questions

Q: What makes this model unique?

The model's standout feature is its ability to achieve high performance in visual-language tasks while maintaining a significantly reduced computational footprint through its 9x token reduction mechanism. This makes it particularly suitable for edge device deployment while still maintaining competitive performance against larger models.

Q: What are the recommended use cases?

The model is ideal for applications requiring on-device visual question answering and image captioning. It performs particularly well in scenarios requiring quick processing of visual information with minimal computational resources, making it perfect for mobile applications, edge devices, and situations where real-time response is crucial.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.