Mini-InternVL-Chat-4B-V1-5
Property | Value |
---|---|
Parameter Count | 4.15B |
Model Type | Multimodal LLM |
Architecture | InternViT-300M + MLP + Phi-3-mini |
License | MIT |
Paper | arXiv:2404.16821 |
What is Mini-InternVL-Chat-4B-V1-5?
Mini-InternVL-Chat-4B-V1-5 is a compact yet powerful multimodal language model that combines vision and language capabilities. It represents a significant advancement in making multimodal AI more accessible, allowing users to run sophisticated visual-language tasks on consumer-grade hardware like a 1080Ti GPU.
Implementation Details
The model integrates a distilled InternViT-300M vision encoder with the Phi-3-mini-128k-instruct language model, connected through an MLP layer. It can process images up to 4K resolution using a dynamic tiling approach with 448x448 pixel patches.
- Dynamic resolution support up to 40 tiles of 448x448 pixels
- 8K context length during training
- Support for BF16/FP16 precision and 4/8-bit quantization
- Multi-GPU deployment capabilities
Core Capabilities
- Single and multi-image processing
- Video understanding (up to 32 segments)
- Multi-turn conversations about visual content
- OCR and text understanding in images
- Multilingual support
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its efficient architecture that enables high-quality multimodal understanding on consumer hardware, while maintaining strong performance across various benchmarks like DocVQA, ChartQA, and MMBench.
Q: What are the recommended use cases?
The model excels at image description, visual question-answering, OCR tasks, video understanding, and multi-turn conversations about visual content. It's particularly suitable for applications requiring efficient multimodal processing with limited computational resources.