imp-v1-3b

Property	Value
Parameter Count	3.19B
Model Type	Multimodal Small Language Model
License	Apache 2.0
Paper	Read Paper
Architecture	Phi-2 (2.7B) + SigLIP Visual Encoder (0.4B)

What is imp-v1-3b?

imp-v1-3b is a groundbreaking multimodal small language model that combines the power of Microsoft's Phi-2 language model with Google's SigLIP visual encoder. Developed by MILVLG at Hangzhou Dianzi University, this model achieves remarkable performance despite its compact size, matching or exceeding the capabilities of larger 7B parameter models.

Implementation Details

The model leverages a hybrid architecture that integrates a 2.7B parameter language model (Phi-2) with a 0.4B parameter visual encoder (SigLIP). Trained on the LLaVA-v1.5 dataset, it processes both text and images efficiently in FP16 precision.

Efficient architecture combining language and vision capabilities
Training based on LLaVA-v1.5 methodology
Optimized for deployment on mobile devices
Compatible with modern transformer-based frameworks

Core Capabilities

Achieves 81.42% accuracy on VQAv2 benchmark
Outperforms similar-sized models across 9 benchmarks
Excels in visual question answering tasks
Supports detailed image-text interactions
Optimized for resource-constrained environments

Frequently Asked Questions

Q: What makes this model unique?

imp-v1-3b stands out for achieving performance comparable to 7B parameter models while using only 3B parameters, making it ideal for mobile and resource-constrained applications. Its architecture efficiently combines vision and language capabilities in a compact form factor.

Q: What are the recommended use cases?

The model is particularly well-suited for visual question answering, image understanding tasks, and multimodal applications where resource efficiency is crucial. It's ideal for mobile devices and robots requiring strong visual-language understanding capabilities.

imp-v1-3b

imp-v1-3b

What is imp-v1-3b?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models