Aquila-VL-2B-llava-qwen

Property	Value
Parameter Count	2.18B
License	Apache 2.0
Paper	Infinity-MM Paper
Languages	English, Chinese
Architecture	LLaVA-one-vision with Qwen2.5-1.5B-instruct LLM

What is Aquila-VL-2B-llava-qwen?

Aquila-VL-2B is an advanced vision-language model that combines the powerful Qwen2.5-1.5B-instruct language model with the SigLIP vision architecture. Trained on the extensive Infinity-MM dataset containing 40 million image-text pairs, it represents a significant advancement in multimodal AI understanding.

Implementation Details

The model utilizes the LLaVA-one-vision framework, integrating Qwen2.5-1.5B-instruct as the language model and siglip-so400m-patch14-384 as the vision tower. It's implemented in BF16 precision and shows impressive performance across various benchmarks.

Trained on 40M image-text pairs from Infinity-MM dataset
Supports both English and Chinese languages
Achieves state-of-the-art performance on multiple benchmarks
Implements advanced vision-language processing capabilities

Core Capabilities

Visual Question Answering with high accuracy
Strong performance in MMBench evaluations (78.8% for English, 76.4% for Chinese)
Excels in mathematical vision tasks (59% on MathVista)
Superior chart and document understanding capabilities
Robust OCR integration and visual reasoning

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its balanced performance across diverse tasks, particularly excelling in mathematical vision and chart understanding while maintaining strong capabilities in general visual question answering. It achieves this with a relatively compact 2.18B parameter size.

Q: What are the recommended use cases?

The model is well-suited for applications requiring visual question answering, document analysis, mathematical problem solving with visual components, and general image understanding tasks in both English and Chinese languages.