Mono-InternVL-2B

Mono-InternVL-2B

OpenGVLab

Mono-InternVL-2B is a monolithic multimodal LLM with 1.8B active parameters, integrating vision and text capabilities through mixture-of-experts mechanism. Built on InternLM2.

PropertyValue
Parameter Count3.11B (1.8B activated)
Model TypeMonolithic Multimodal LLM
LicenseMIT
PaperView Paper
Base ModelInternLM2-chat-1.8B

What is Mono-InternVL-2B?

Mono-InternVL-2B is a groundbreaking monolithic multimodal large language model that uniquely integrates visual encoding and textual decoding into a single LLM architecture. Using a mixture-of-experts (MoE) mechanism, it embeds visual experts into a pre-trained language model while preserving the original language capabilities through parameter freezing.

Implementation Details

The model implements an innovative Endogenous Visual Pretraining (EViP) approach for coarse-to-fine visual learning. With 1.8B activated parameters out of 3.11B total, it achieves superior performance while maintaining efficient deployment with up to 67% reduced first-token latency.

  • Built on InternLM2-chat-1.8B architecture
  • Supports both pure text and image-text conversations
  • Uses BF16 tensor type for optimal performance
  • Implements dynamic image preprocessing for various aspect ratios

Core Capabilities

  • Multimodal understanding and generation across various benchmarks
  • Strong performance in visual question answering (70.1% average VQA score)
  • Superior OCR capabilities with 767 score on OCRBench
  • Effective visual-linguistic understanding with 65.5% on MMBench-EN

Frequently Asked Questions

Q: What makes this model unique?

Its monolithic architecture integrates vision and language capabilities into a single model, unlike traditional modular approaches. The use of visual experts through MoE allows for efficient visual processing while maintaining strong language abilities.

Q: What are the recommended use cases?

The model excels in image-text tasks including visual question answering, OCR, chart understanding, and general visual-linguistic tasks. It's particularly effective for applications requiring both visual understanding and natural language generation.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026