InternVL2-Llama3-76B
Property | Value |
---|---|
Parameter Count | 76.3B |
License | Llama3 Community License |
Paper | arXiv:2404.16821 |
Architecture | InternViT-6B + Llama3-70B |
What is InternVL2-Llama3-76B?
InternVL2-Llama3-76B is a state-of-the-art multimodal large language model that combines InternViT-6B-448px-V1-5 for vision processing with Hermes-2-Theta-Llama-3-70B for language understanding. It represents the largest model in the InternVL 2.0 series, trained with an 8k context window to handle complex visual-linguistic tasks.
Implementation Details
The model architecture consists of three main components: a vision encoder (InternViT-6B), an MLP projector for multimodal fusion, and a language model (Llama3-70B). It supports both BF16 and 8-bit quantization for efficient deployment, though 4-bit quantization is not recommended due to performance degradation.
- Trained with 8k context window for enhanced long-form understanding
- Supports multiple input formats including images, documents, and videos
- Implements efficient multi-GPU inference with customizable device mapping
Core Capabilities
- Document and Chart Understanding: 94.1% on DocVQA, 88.4% on ChartQA
- Visual Question Answering: Strong performance across MMBench (86.5%) and MME benchmarks
- Scene Text Understanding: 839 score on OCRBench
- Video Analysis: Supports up to 16-frame video processing with competitive performance
- Visual Grounding: 90.0% average accuracy across RefCOCO benchmarks
Frequently Asked Questions
Q: What makes this model unique?
The model uniquely combines one of the largest vision transformers (InternViT-6B) with Llama3, achieving performance competitive with commercial models while remaining open for research use. Its 8k context window and multi-frame video capabilities set it apart from many other open-source multimodal models.
Q: What are the recommended use cases?
The model excels in complex visual-linguistic tasks including document analysis, chart interpretation, visual question answering, and video understanding. It's particularly suitable for applications requiring sophisticated understanding of both visual and textual content, such as document processing systems, educational tools, and content analysis platforms.