Aquila-VL-2B-llava-qwen
Property | Value |
---|---|
Parameter Count | 2.18B |
License | Apache 2.0 |
Paper | Infinity-MM Paper |
Languages | English, Chinese |
Architecture | LLaVA-one-vision with Qwen2.5-1.5B-instruct LLM |
What is Aquila-VL-2B-llava-qwen?
Aquila-VL-2B is an advanced vision-language model that combines the powerful Qwen2.5-1.5B-instruct language model with the SigLIP vision architecture. Trained on the extensive Infinity-MM dataset containing 40 million image-text pairs, it represents a significant advancement in multimodal AI understanding.
Implementation Details
The model utilizes the LLaVA-one-vision framework, integrating Qwen2.5-1.5B-instruct as the language model and siglip-so400m-patch14-384 as the vision tower. It's implemented in BF16 precision and shows impressive performance across various benchmarks.
- Trained on 40M image-text pairs from Infinity-MM dataset
- Supports both English and Chinese languages
- Achieves state-of-the-art performance on multiple benchmarks
- Implements advanced vision-language processing capabilities
Core Capabilities
- Visual Question Answering with high accuracy
- Strong performance in MMBench evaluations (78.8% for English, 76.4% for Chinese)
- Excels in mathematical vision tasks (59% on MathVista)
- Superior chart and document understanding capabilities
- Robust OCR integration and visual reasoning
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its balanced performance across diverse tasks, particularly excelling in mathematical vision and chart understanding while maintaining strong capabilities in general visual question answering. It achieves this with a relatively compact 2.18B parameter size.
Q: What are the recommended use cases?
The model is well-suited for applications requiring visual question answering, document analysis, mathematical problem solving with visual components, and general image understanding tasks in both English and Chinese languages.