imp-v1-3b
Property | Value |
---|---|
Parameter Count | 3.19B |
Model Type | Multimodal Small Language Model |
License | Apache 2.0 |
Paper | Read Paper |
Architecture | Phi-2 (2.7B) + SigLIP Visual Encoder (0.4B) |
What is imp-v1-3b?
imp-v1-3b is a groundbreaking multimodal small language model that combines the power of Microsoft's Phi-2 language model with Google's SigLIP visual encoder. Developed by MILVLG at Hangzhou Dianzi University, this model achieves remarkable performance despite its compact size, matching or exceeding the capabilities of larger 7B parameter models.
Implementation Details
The model leverages a hybrid architecture that integrates a 2.7B parameter language model (Phi-2) with a 0.4B parameter visual encoder (SigLIP). Trained on the LLaVA-v1.5 dataset, it processes both text and images efficiently in FP16 precision.
- Efficient architecture combining language and vision capabilities
- Training based on LLaVA-v1.5 methodology
- Optimized for deployment on mobile devices
- Compatible with modern transformer-based frameworks
Core Capabilities
- Achieves 81.42% accuracy on VQAv2 benchmark
- Outperforms similar-sized models across 9 benchmarks
- Excels in visual question answering tasks
- Supports detailed image-text interactions
- Optimized for resource-constrained environments
Frequently Asked Questions
Q: What makes this model unique?
imp-v1-3b stands out for achieving performance comparable to 7B parameter models while using only 3B parameters, making it ideal for mobile and resource-constrained applications. Its architecture efficiently combines vision and language capabilities in a compact form factor.
Q: What are the recommended use cases?
The model is particularly well-suited for visual question answering, image understanding tasks, and multimodal applications where resource efficiency is crucial. It's ideal for mobile devices and robots requiring strong visual-language understanding capabilities.