imp-v1-3b

imp-v1-3b

MILVLG

A compact 3B parameter multimodal LLM combining Phi-2 and SigLIP visual encoder, achieving performance comparable to 7B models on visual tasks.

PropertyValue
Parameter Count3.19B
Model TypeMultimodal Small Language Model
LicenseApache 2.0
PaperRead Paper
ArchitecturePhi-2 (2.7B) + SigLIP Visual Encoder (0.4B)

What is imp-v1-3b?

imp-v1-3b is a groundbreaking multimodal small language model that combines the power of Microsoft's Phi-2 language model with Google's SigLIP visual encoder. Developed by MILVLG at Hangzhou Dianzi University, this model achieves remarkable performance despite its compact size, matching or exceeding the capabilities of larger 7B parameter models.

Implementation Details

The model leverages a hybrid architecture that integrates a 2.7B parameter language model (Phi-2) with a 0.4B parameter visual encoder (SigLIP). Trained on the LLaVA-v1.5 dataset, it processes both text and images efficiently in FP16 precision.

  • Efficient architecture combining language and vision capabilities
  • Training based on LLaVA-v1.5 methodology
  • Optimized for deployment on mobile devices
  • Compatible with modern transformer-based frameworks

Core Capabilities

  • Achieves 81.42% accuracy on VQAv2 benchmark
  • Outperforms similar-sized models across 9 benchmarks
  • Excels in visual question answering tasks
  • Supports detailed image-text interactions
  • Optimized for resource-constrained environments

Frequently Asked Questions

Q: What makes this model unique?

imp-v1-3b stands out for achieving performance comparable to 7B parameter models while using only 3B parameters, making it ideal for mobile and resource-constrained applications. Its architecture efficiently combines vision and language capabilities in a compact form factor.

Q: What are the recommended use cases?

The model is particularly well-suited for visual question answering, image understanding tasks, and multimodal applications where resource efficiency is crucial. It's ideal for mobile devices and robots requiring strong visual-language understanding capabilities.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026