InternVL2-Llama3-76B

Maintained By
OpenGVLab

InternVL2-Llama3-76B

PropertyValue
Parameter Count76.3B
LicenseLlama3 Community License
PaperarXiv:2404.16821
ArchitectureInternViT-6B + Llama3-70B

What is InternVL2-Llama3-76B?

InternVL2-Llama3-76B is a state-of-the-art multimodal large language model that combines InternViT-6B-448px-V1-5 for vision processing with Hermes-2-Theta-Llama-3-70B for language understanding. It represents the largest model in the InternVL 2.0 series, trained with an 8k context window to handle complex visual-linguistic tasks.

Implementation Details

The model architecture consists of three main components: a vision encoder (InternViT-6B), an MLP projector for multimodal fusion, and a language model (Llama3-70B). It supports both BF16 and 8-bit quantization for efficient deployment, though 4-bit quantization is not recommended due to performance degradation.

  • Trained with 8k context window for enhanced long-form understanding
  • Supports multiple input formats including images, documents, and videos
  • Implements efficient multi-GPU inference with customizable device mapping

Core Capabilities

  • Document and Chart Understanding: 94.1% on DocVQA, 88.4% on ChartQA
  • Visual Question Answering: Strong performance across MMBench (86.5%) and MME benchmarks
  • Scene Text Understanding: 839 score on OCRBench
  • Video Analysis: Supports up to 16-frame video processing with competitive performance
  • Visual Grounding: 90.0% average accuracy across RefCOCO benchmarks

Frequently Asked Questions

Q: What makes this model unique?

The model uniquely combines one of the largest vision transformers (InternViT-6B) with Llama3, achieving performance competitive with commercial models while remaining open for research use. Its 8k context window and multi-frame video capabilities set it apart from many other open-source multimodal models.

Q: What are the recommended use cases?

The model excels in complex visual-linguistic tasks including document analysis, chart interpretation, visual question answering, and video understanding. It's particularly suitable for applications requiring sophisticated understanding of both visual and textual content, such as document processing systems, educational tools, and content analysis platforms.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.