OmniVLM-968M

Maintained By
NexaAIDev

OmniVLM-968M

PropertyValue
Model Size968M parameters
TypeVision-Language Model
AuthorNexaAIDev
Latest Versionv3 (December 2024)
RepositoryHugging Face

What is OmniVLM-968M?

OmniVLM-968M is a groundbreaking sub-billion parameter vision-language model specifically designed for efficient edge device deployment. Built upon LLaVA's architecture, it introduces revolutionary token compression technology that reduces image tokens from 729 to 81, significantly improving computational efficiency while maintaining high performance.

Implementation Details

The model architecture combines three key components: Qwen2.5-0.5B-Instruct as the base language model, SigLIP-400M as the vision encoder operating at 384 resolution, and a custom MLP projection layer that enables the 9x token reduction. The model undergoes a three-stage training pipeline including pretraining, supervised fine-tuning, and Direct Preference Optimization (DPO).

  • 9x Token Reduction Technology
  • DPO-enhanced accuracy and reduced hallucinations
  • Efficient on-device processing (< 2s for 1046×1568 images)
  • Minimal resource requirements (988 MB RAM, 948 MB Storage)

Core Capabilities

  • Visual Question Answering
  • Image Captioning
  • Complex Image Understanding
  • Art Description Generation
  • Accurate Color and Detail Detection

Frequently Asked Questions

Q: What makes this model unique?

OmniVLM-968M stands out for its revolutionary token compression technology that enables efficient edge device deployment while maintaining competitive performance against larger models. It outperforms previous small-scale VLMs across multiple benchmarks including ScienceQA, POPE, and MM-VET.

Q: What are the recommended use cases?

The model is ideal for on-device applications requiring visual question answering and image captioning. Its efficient architecture makes it particularly suitable for edge devices where computational resources are limited but real-time performance is needed.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.