InternVL2_5-4B-MPO

Property	Value
Model Size	4B parameters
Vision Encoder	InternViT-300M-448px-V2_5
Language Model	Qwen2.5-3B-Instruct
License	MIT License
Paper	arXiv:2411.10442

What is InternVL2_5-4B-MPO?

InternVL2_5-4B-MPO is an advanced multimodal large language model that combines vision and language capabilities through Mixed Preference Optimization (MPO). It uses a "ViT-MLP-LLM" architecture, integrating InternViT for vision processing with Qwen2.5-3B-Instruct for language understanding.

Implementation Details

The model employs a sophisticated architecture that processes images through dynamic resolution strategy with 448×448 pixel tiles. It features pixel unshuffle operations to reduce visual tokens and supports both single and multi-image inputs, as well as video processing capabilities.

Mixed Preference Optimization combining preference, quality, and generation losses
Support for multi-image and video processing
Dynamic resolution strategy for optimal image handling
Efficient batch processing and streaming output capabilities

Core Capabilities

Strong performance across multiple benchmarks (67.6% average score)
Advanced reasoning and visual understanding
Multi-turn conversations with images and videos
Batch inference and streaming output support
Deployment options through LMDeploy

Frequently Asked Questions

Q: What makes this model unique?

The model's Mixed Preference Optimization approach sets it apart, combining three types of losses to enhance reasoning capabilities while maintaining high performance across various multimodal tasks.

Q: What are the recommended use cases?

The model excels in visual-linguistic tasks including image description, multi-image comparison, video analysis, and interactive conversations about visual content. It's particularly suitable for applications requiring sophisticated reasoning about visual inputs.