Mono-InternVL-2B

Property	Value
Parameter Count	3.11B (1.8B activated)
Model Type	Monolithic Multimodal LLM
License	MIT
Paper	View Paper
Base Model	InternLM2-chat-1.8B

What is Mono-InternVL-2B?

Mono-InternVL-2B is a groundbreaking monolithic multimodal large language model that uniquely integrates visual encoding and textual decoding into a single LLM architecture. Using a mixture-of-experts (MoE) mechanism, it embeds visual experts into a pre-trained language model while preserving the original language capabilities through parameter freezing.

Implementation Details

The model implements an innovative Endogenous Visual Pretraining (EViP) approach for coarse-to-fine visual learning. With 1.8B activated parameters out of 3.11B total, it achieves superior performance while maintaining efficient deployment with up to 67% reduced first-token latency.

Built on InternLM2-chat-1.8B architecture
Supports both pure text and image-text conversations
Uses BF16 tensor type for optimal performance
Implements dynamic image preprocessing for various aspect ratios

Core Capabilities

Multimodal understanding and generation across various benchmarks
Strong performance in visual question answering (70.1% average VQA score)
Superior OCR capabilities with 767 score on OCRBench
Effective visual-linguistic understanding with 65.5% on MMBench-EN

Frequently Asked Questions

Q: What makes this model unique?

Its monolithic architecture integrates vision and language capabilities into a single model, unlike traditional modular approaches. The use of visual experts through MoE allows for efficient visual processing while maintaining strong language abilities.

Q: What are the recommended use cases?

The model excels in image-text tasks including visual question answering, OCR, chart understanding, and general visual-linguistic tasks. It's particularly effective for applications requiring both visual understanding and natural language generation.

Mono-InternVL-2B

Mono-InternVL-2B

What is Mono-InternVL-2B?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models