InternVL2_5-1B-MPO

Property	Value
Model Type	Multimodal Large Language Model
Vision Component	InternViT-300M-448px-V2_5
Language Component	Qwen2.5-0.5B-Instruct
License	MIT License
Paper	arXiv:2411.10442

What is InternVL2_5-1B-MPO?

InternVL2_5-1B-MPO is an advanced multimodal large language model that combines vision and language capabilities through Mixed Preference Optimization (MPO). It's designed to enhance reasoning abilities in vision-language tasks by utilizing a sophisticated training approach that optimizes both preference learning and response generation.

Implementation Details

The model follows a "ViT-MLP-LLM" architecture, combining InternViT for vision processing with Qwen2.5-0.5B-Instruct for language understanding. It implements dynamic resolution strategy with 448×448 pixel tiles and supports both single and multi-image processing, as well as video analysis.

Utilizes Mixed Preference Optimization (MPO) combining preference, quality, and generation losses
Supports dynamic resolution handling for various image sizes
Implements efficient processing through pixel unshuffle operations
Features multi-modal preference dataset (MMPR) with 3 million samples

Core Capabilities

Multi-image and video processing support
Pure text and vision-language conversations
Streaming output generation
Batch inference processing
Dynamic resolution handling for various image sizes

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its Mixed Preference Optimization approach, which combines three types of losses (preference, quality, and generation) to enhance reasoning capabilities. It also features a sophisticated data construction pipeline for training on multimodal preferences.

Q: What are the recommended use cases?

The model excels in vision-language tasks including image description, multi-image comparison, video analysis, and interactive conversations about visual content. It's particularly suitable for applications requiring detailed visual reasoning and natural language interaction.