llava-onevision-qwen2-7b-ov

lmms-lab

A powerful 8.03B parameter multimodal model capable of processing images and videos, achieving 80.8% accuracy on MMBench with strong performance across 30+ benchmarks.

Property	Value
Parameter Count	8.03B
License	Apache 2.0
Languages	English, Chinese
Paper	LLaVA-OneVision Paper
Training Data	LLaVA-OneVision Dataset

What is llava-onevision-qwen2-7b-ov?

LLaVA-OneVision is a state-of-the-art multimodal model built on the Qwen2 architecture, designed to process and understand both images and videos. With 8.03B parameters and trained using bfloat16 precision, it represents a significant advancement in visual-language understanding, achieving impressive performance across multiple benchmarks.

Implementation Details

The model utilizes a sophisticated architecture combining SO400M with Qwen2, implementing a four-stage training process: LCS-558K pretraining, mid-stage training on 4.7M synthetic data, final-image stage with 3.6M single-image data, and OneVision stage using 1.6M mixed media data.

Context window of 32K tokens
Trained on 256 Nvidia Tesla A100 GPUs
Implements Huggingface Trainer and PyTorch framework
Supports both image and video processing capabilities

Core Capabilities

90.2% accuracy on DocVQA benchmark
80.8% accuracy on MMBench
96.0% accuracy on Science-QA
Effective processing of multi-image and video inputs
Bilingual support for English and Chinese

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its comprehensive training approach across multiple stages and ability to handle various visual inputs, from single images to videos, while maintaining high performance across diverse benchmarks.

Q: What are the recommended use cases?

The model excels in document analysis, scientific question answering, chart interpretation, and general visual-language tasks, making it suitable for educational, research, and commercial applications requiring sophisticated visual understanding.