Mono-InternVL-2B

Maintained By
OpenGVLab

Mono-InternVL-2B

PropertyValue
Parameter Count3.11B (1.8B activated)
Model TypeMonolithic Multimodal LLM
LicenseMIT
PaperView Paper
Base ModelInternLM2-chat-1.8B

What is Mono-InternVL-2B?

Mono-InternVL-2B is a groundbreaking monolithic multimodal large language model that uniquely integrates visual encoding and textual decoding into a single LLM architecture. Using a mixture-of-experts (MoE) mechanism, it embeds visual experts into a pre-trained language model while preserving the original language capabilities through parameter freezing.

Implementation Details

The model implements an innovative Endogenous Visual Pretraining (EViP) approach for coarse-to-fine visual learning. With 1.8B activated parameters out of 3.11B total, it achieves superior performance while maintaining efficient deployment with up to 67% reduced first-token latency.

  • Built on InternLM2-chat-1.8B architecture
  • Supports both pure text and image-text conversations
  • Uses BF16 tensor type for optimal performance
  • Implements dynamic image preprocessing for various aspect ratios

Core Capabilities

  • Multimodal understanding and generation across various benchmarks
  • Strong performance in visual question answering (70.1% average VQA score)
  • Superior OCR capabilities with 767 score on OCRBench
  • Effective visual-linguistic understanding with 65.5% on MMBench-EN

Frequently Asked Questions

Q: What makes this model unique?

Its monolithic architecture integrates vision and language capabilities into a single model, unlike traditional modular approaches. The use of visual experts through MoE allows for efficient visual processing while maintaining strong language abilities.

Q: What are the recommended use cases?

The model excels in image-text tasks including visual question answering, OCR, chart understanding, and general visual-linguistic tasks. It's particularly effective for applications requiring both visual understanding and natural language generation.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.