InternVL2_5-4B-MPO

Maintained By
OpenGVLab

InternVL2_5-4B-MPO

PropertyValue
Model Size4B parameters
Vision EncoderInternViT-300M-448px-V2_5
Language ModelQwen2.5-3B-Instruct
LicenseMIT License
PaperarXiv:2411.10442

What is InternVL2_5-4B-MPO?

InternVL2_5-4B-MPO is an advanced multimodal large language model that combines vision and language capabilities through Mixed Preference Optimization (MPO). It uses a "ViT-MLP-LLM" architecture, integrating InternViT for vision processing with Qwen2.5-3B-Instruct for language understanding.

Implementation Details

The model employs a sophisticated architecture that processes images through dynamic resolution strategy with 448×448 pixel tiles. It features pixel unshuffle operations to reduce visual tokens and supports both single and multi-image inputs, as well as video processing capabilities.

  • Mixed Preference Optimization combining preference, quality, and generation losses
  • Support for multi-image and video processing
  • Dynamic resolution strategy for optimal image handling
  • Efficient batch processing and streaming output capabilities

Core Capabilities

  • Strong performance across multiple benchmarks (67.6% average score)
  • Advanced reasoning and visual understanding
  • Multi-turn conversations with images and videos
  • Batch inference and streaming output support
  • Deployment options through LMDeploy

Frequently Asked Questions

Q: What makes this model unique?

The model's Mixed Preference Optimization approach sets it apart, combining three types of losses to enhance reasoning capabilities while maintaining high performance across various multimodal tasks.

Q: What are the recommended use cases?

The model excels in visual-linguistic tasks including image description, multi-image comparison, video analysis, and interactive conversations about visual content. It's particularly suitable for applications requiring sophisticated reasoning about visual inputs.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.