OmniVLM-968M

Property	Value
Model Size	968M parameters
Type	Vision-Language Model
Author	NexaAIDev
Latest Version	v3 (December 2024)
Repository	Hugging Face

What is OmniVLM-968M?

OmniVLM-968M is a groundbreaking sub-billion parameter vision-language model specifically designed for efficient edge device deployment. Built upon LLaVA's architecture, it introduces revolutionary token compression technology that reduces image tokens from 729 to 81, significantly improving computational efficiency while maintaining high performance.

Implementation Details

The model architecture combines three key components: Qwen2.5-0.5B-Instruct as the base language model, SigLIP-400M as the vision encoder operating at 384 resolution, and a custom MLP projection layer that enables the 9x token reduction. The model undergoes a three-stage training pipeline including pretraining, supervised fine-tuning, and Direct Preference Optimization (DPO).

9x Token Reduction Technology
DPO-enhanced accuracy and reduced hallucinations
Efficient on-device processing (< 2s for 1046×1568 images)
Minimal resource requirements (988 MB RAM, 948 MB Storage)

Core Capabilities

Visual Question Answering
Image Captioning
Complex Image Understanding
Art Description Generation
Accurate Color and Detail Detection

Frequently Asked Questions

Q: What makes this model unique?

OmniVLM-968M stands out for its revolutionary token compression technology that enables efficient edge device deployment while maintaining competitive performance against larger models. It outperforms previous small-scale VLMs across multiple benchmarks including ScienceQA, POPE, and MM-VET.

Q: What are the recommended use cases?

The model is ideal for on-device applications requiring visual question answering and image captioning. Its efficient architecture makes it particularly suitable for edge devices where computational resources are limited but real-time performance is needed.

OmniVLM-968M

OmniVLM-968M

What is OmniVLM-968M?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models