Aquila-VL-2B-llava-qwen

Aquila-VL-2B-llava-qwen

BAAI

A 2.18B parameter vision-language model combining Qwen2.5-1.5B-instruct LLM with SigLIP vision architecture, trained on 40M image-text pairs for multimodal understanding.

PropertyValue
Parameter Count2.18B
LicenseApache 2.0
PaperInfinity-MM Paper
LanguagesEnglish, Chinese
ArchitectureLLaVA-one-vision with Qwen2.5-1.5B-instruct LLM

What is Aquila-VL-2B-llava-qwen?

Aquila-VL-2B is an advanced vision-language model that combines the powerful Qwen2.5-1.5B-instruct language model with the SigLIP vision architecture. Trained on the extensive Infinity-MM dataset containing 40 million image-text pairs, it represents a significant advancement in multimodal AI understanding.

Implementation Details

The model utilizes the LLaVA-one-vision framework, integrating Qwen2.5-1.5B-instruct as the language model and siglip-so400m-patch14-384 as the vision tower. It's implemented in BF16 precision and shows impressive performance across various benchmarks.

  • Trained on 40M image-text pairs from Infinity-MM dataset
  • Supports both English and Chinese languages
  • Achieves state-of-the-art performance on multiple benchmarks
  • Implements advanced vision-language processing capabilities

Core Capabilities

  • Visual Question Answering with high accuracy
  • Strong performance in MMBench evaluations (78.8% for English, 76.4% for Chinese)
  • Excels in mathematical vision tasks (59% on MathVista)
  • Superior chart and document understanding capabilities
  • Robust OCR integration and visual reasoning

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its balanced performance across diverse tasks, particularly excelling in mathematical vision and chart understanding while maintaining strong capabilities in general visual question answering. It achieves this with a relatively compact 2.18B parameter size.

Q: What are the recommended use cases?

The model is well-suited for applications requiring visual question answering, document analysis, mathematical problem solving with visual components, and general image understanding tasks in both English and Chinese languages.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026