InternVL2_5-1B-MPO
Property | Value |
---|---|
Model Type | Multimodal Large Language Model |
Vision Component | InternViT-300M-448px-V2_5 |
Language Component | Qwen2.5-0.5B-Instruct |
License | MIT License |
Paper | arXiv:2411.10442 |
What is InternVL2_5-1B-MPO?
InternVL2_5-1B-MPO is an advanced multimodal large language model that combines vision and language capabilities through Mixed Preference Optimization (MPO). It's designed to enhance reasoning abilities in vision-language tasks by utilizing a sophisticated training approach that optimizes both preference learning and response generation.
Implementation Details
The model follows a "ViT-MLP-LLM" architecture, combining InternViT for vision processing with Qwen2.5-0.5B-Instruct for language understanding. It implements dynamic resolution strategy with 448×448 pixel tiles and supports both single and multi-image processing, as well as video analysis.
- Utilizes Mixed Preference Optimization (MPO) combining preference, quality, and generation losses
- Supports dynamic resolution handling for various image sizes
- Implements efficient processing through pixel unshuffle operations
- Features multi-modal preference dataset (MMPR) with 3 million samples
Core Capabilities
- Multi-image and video processing support
- Pure text and vision-language conversations
- Streaming output generation
- Batch inference processing
- Dynamic resolution handling for various image sizes
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its Mixed Preference Optimization approach, which combines three types of losses (preference, quality, and generation) to enhance reasoning capabilities. It also features a sophisticated data construction pipeline for training on multimodal preferences.
Q: What are the recommended use cases?
The model excels in vision-language tasks including image description, multi-image comparison, video analysis, and interactive conversations about visual content. It's particularly suitable for applications requiring detailed visual reasoning and natural language interaction.