Mono-InternVL-2B
Property | Value |
---|---|
Parameter Count | 3.11B (1.8B activated) |
Model Type | Monolithic Multimodal LLM |
License | MIT |
Paper | View Paper |
Base Model | InternLM2-chat-1.8B |
What is Mono-InternVL-2B?
Mono-InternVL-2B is a groundbreaking monolithic multimodal large language model that uniquely integrates visual encoding and textual decoding into a single LLM architecture. Using a mixture-of-experts (MoE) mechanism, it embeds visual experts into a pre-trained language model while preserving the original language capabilities through parameter freezing.
Implementation Details
The model implements an innovative Endogenous Visual Pretraining (EViP) approach for coarse-to-fine visual learning. With 1.8B activated parameters out of 3.11B total, it achieves superior performance while maintaining efficient deployment with up to 67% reduced first-token latency.
- Built on InternLM2-chat-1.8B architecture
- Supports both pure text and image-text conversations
- Uses BF16 tensor type for optimal performance
- Implements dynamic image preprocessing for various aspect ratios
Core Capabilities
- Multimodal understanding and generation across various benchmarks
- Strong performance in visual question answering (70.1% average VQA score)
- Superior OCR capabilities with 767 score on OCRBench
- Effective visual-linguistic understanding with 65.5% on MMBench-EN
Frequently Asked Questions
Q: What makes this model unique?
Its monolithic architecture integrates vision and language capabilities into a single model, unlike traditional modular approaches. The use of visual experts through MoE allows for efficient visual processing while maintaining strong language abilities.
Q: What are the recommended use cases?
The model excels in image-text tasks including visual question answering, OCR, chart understanding, and general visual-linguistic tasks. It's particularly effective for applications requiring both visual understanding and natural language generation.