OmniVLM-968M
Property | Value |
---|---|
Model Size | 968M parameters |
Type | Vision-Language Model |
Author | NexaAIDev |
Latest Version | v3 (December 2024) |
Repository | Hugging Face |
What is OmniVLM-968M?
OmniVLM-968M is a groundbreaking sub-billion parameter vision-language model specifically designed for efficient edge device deployment. Built upon LLaVA's architecture, it introduces revolutionary token compression technology that reduces image tokens from 729 to 81, significantly improving computational efficiency while maintaining high performance.
Implementation Details
The model architecture combines three key components: Qwen2.5-0.5B-Instruct as the base language model, SigLIP-400M as the vision encoder operating at 384 resolution, and a custom MLP projection layer that enables the 9x token reduction. The model undergoes a three-stage training pipeline including pretraining, supervised fine-tuning, and Direct Preference Optimization (DPO).
- 9x Token Reduction Technology
- DPO-enhanced accuracy and reduced hallucinations
- Efficient on-device processing (< 2s for 1046×1568 images)
- Minimal resource requirements (988 MB RAM, 948 MB Storage)
Core Capabilities
- Visual Question Answering
- Image Captioning
- Complex Image Understanding
- Art Description Generation
- Accurate Color and Detail Detection
Frequently Asked Questions
Q: What makes this model unique?
OmniVLM-968M stands out for its revolutionary token compression technology that enables efficient edge device deployment while maintaining competitive performance against larger models. It outperforms previous small-scale VLMs across multiple benchmarks including ScienceQA, POPE, and MM-VET.
Q: What are the recommended use cases?
The model is ideal for on-device applications requiring visual question answering and image captioning. Its efficient architecture makes it particularly suitable for edge devices where computational resources are limited but real-time performance is needed.