InternVL2_5-4B-MPO
Property | Value |
---|---|
Model Size | 4B parameters |
Vision Encoder | InternViT-300M-448px-V2_5 |
Language Model | Qwen2.5-3B-Instruct |
License | MIT License |
Paper | arXiv:2411.10442 |
What is InternVL2_5-4B-MPO?
InternVL2_5-4B-MPO is an advanced multimodal large language model that combines vision and language capabilities through Mixed Preference Optimization (MPO). It uses a "ViT-MLP-LLM" architecture, integrating InternViT for vision processing with Qwen2.5-3B-Instruct for language understanding.
Implementation Details
The model employs a sophisticated architecture that processes images through dynamic resolution strategy with 448×448 pixel tiles. It features pixel unshuffle operations to reduce visual tokens and supports both single and multi-image inputs, as well as video processing capabilities.
- Mixed Preference Optimization combining preference, quality, and generation losses
- Support for multi-image and video processing
- Dynamic resolution strategy for optimal image handling
- Efficient batch processing and streaming output capabilities
Core Capabilities
- Strong performance across multiple benchmarks (67.6% average score)
- Advanced reasoning and visual understanding
- Multi-turn conversations with images and videos
- Batch inference and streaming output support
- Deployment options through LMDeploy
Frequently Asked Questions
Q: What makes this model unique?
The model's Mixed Preference Optimization approach sets it apart, combining three types of losses to enhance reasoning capabilities while maintaining high performance across various multimodal tasks.
Q: What are the recommended use cases?
The model excels in visual-linguistic tasks including image description, multi-image comparison, video analysis, and interactive conversations about visual content. It's particularly suitable for applications requiring sophisticated reasoning about visual inputs.