InternVL2-1B

OpenGVLab

An advanced 938M parameter multimodal LLM combining InternViT-300M-448px vision encoder with Qwen2-0.5B-Instruct language model for versatile visual-language tasks.

Property	Value
Parameter Count	938M parameters
License	MIT
Paper	InternVL Paper
Architecture	InternViT-300M-448px + Qwen2-0.5B-Instruct

What is InternVL2-1B?

InternVL2-1B is a compact yet powerful multimodal large language model that combines visual and linguistic capabilities. It's part of the InternVL 2.0 series, featuring a 938M parameter architecture that integrates InternViT-300M-448px for vision processing with Qwen2-0.5B-Instruct for language understanding.

Implementation Details

The model utilizes an 8k context window and is specifically designed to handle multiple types of input, including long texts, multiple images, and videos. It employs BF16 precision and supports various deployment options, from 16-bit inference to 4-bit quantization.

Supports both single and multi-image processing
Capable of handling video inputs with frame extraction
Implements efficient attention mechanisms with flash attention support
Offers flexible deployment options including multi-GPU distribution

Core Capabilities

Document and chart comprehension (81.7% on DocVQA test)
Scene text understanding and OCR tasks (754 on OCRBench)
Multi-image and video analysis
Cultural understanding and integrated multimodal reasoning
Strong performance on MME benchmark (1794.4 sum score)

Frequently Asked Questions

Q: What makes this model unique?

InternVL2-1B stands out for its efficient architecture that achieves competitive performance despite its relatively small size. It demonstrates strong capabilities across various visual-language tasks while maintaining a compact parameter count of 938M.

Q: What are the recommended use cases?

The model excels in document analysis, image understanding, video comprehension, and multimodal reasoning tasks. It's particularly suitable for applications requiring efficient deployment while maintaining robust performance across various visual-language tasks.