Aquila-VL-2B-llava-qwen

Maintained By
BAAI

Aquila-VL-2B-llava-qwen

PropertyValue
Parameter Count2.18B
LicenseApache 2.0
PaperInfinity-MM Paper
LanguagesEnglish, Chinese
ArchitectureLLaVA-one-vision with Qwen2.5-1.5B-instruct LLM

What is Aquila-VL-2B-llava-qwen?

Aquila-VL-2B is an advanced vision-language model that combines the powerful Qwen2.5-1.5B-instruct language model with the SigLIP vision architecture. Trained on the extensive Infinity-MM dataset containing 40 million image-text pairs, it represents a significant advancement in multimodal AI understanding.

Implementation Details

The model utilizes the LLaVA-one-vision framework, integrating Qwen2.5-1.5B-instruct as the language model and siglip-so400m-patch14-384 as the vision tower. It's implemented in BF16 precision and shows impressive performance across various benchmarks.

  • Trained on 40M image-text pairs from Infinity-MM dataset
  • Supports both English and Chinese languages
  • Achieves state-of-the-art performance on multiple benchmarks
  • Implements advanced vision-language processing capabilities

Core Capabilities

  • Visual Question Answering with high accuracy
  • Strong performance in MMBench evaluations (78.8% for English, 76.4% for Chinese)
  • Excels in mathematical vision tasks (59% on MathVista)
  • Superior chart and document understanding capabilities
  • Robust OCR integration and visual reasoning

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its balanced performance across diverse tasks, particularly excelling in mathematical vision and chart understanding while maintaining strong capabilities in general visual question answering. It achieves this with a relatively compact 2.18B parameter size.

Q: What are the recommended use cases?

The model is well-suited for applications requiring visual question answering, document analysis, mathematical problem solving with visual components, and general image understanding tasks in both English and Chinese languages.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.