InternVL2_5-38B
Property | Value |
---|---|
Model Size | 38B parameters |
Vision Model | InternViT-6B-448px-V2_5 |
Language Model | Qwen2.5-32B-Instruct |
License | MIT License |
Paper | arXiv:2412.05271 |
What is InternVL2_5-38B?
InternVL2_5-38B is a state-of-the-art multimodal large language model that combines advanced vision capabilities with powerful language understanding. It represents a significant evolution in the InternVL family, featuring enhanced training strategies and improved data quality optimization. The model follows a "ViT-MLP-LLM" architecture paradigm, integrating InternViT for vision processing with Qwen2.5-32B-Instruct for language tasks.
Implementation Details
The model implements a sophisticated three-stage training pipeline: MLP Warmup, optional ViT Incremental Learning, and Full Model Instruction Tuning. It utilizes dynamic high-resolution training for handling multiple image formats and video data, with advanced data filtering mechanisms to ensure high-quality training samples.
- Progressive scaling strategy requiring only 120B tokens compared to competitors' 1.4T tokens
- Dynamic resolution handling supporting single images, multiple images, and video content
- Advanced data filtering pipeline with LLM-based quality scoring and repetition detection
- Random JPEG compression for enhanced robustness to image quality variations
Core Capabilities
- Multi-modal reasoning and mathematics comprehension
- OCR and document understanding
- Multi-image and video analysis
- Visual grounding and multilingual understanding
- Improved pure language capabilities compared to previous versions
Frequently Asked Questions
Q: What makes this model unique?
InternVL2_5-38B stands out for its efficient training approach and superior multimodal capabilities while maintaining strong language performance. Its progressive scaling strategy and advanced data filtering pipeline ensure high-quality outputs with significantly less training data than competitors.
Q: What are the recommended use cases?
The model excels in complex visual-linguistic tasks including document analysis, multi-image comparison, video understanding, and mathematical reasoning. It's particularly suitable for applications requiring sophisticated multimodal comprehension and generation capabilities.