LLaVA-Llama-3-8B-v1.1 Transformers

Property	Value
Base Model	Meta-Llama-3-8B-Instruct
Visual Encoder	CLIP-ViT-Large-patch14-336
Training Strategy	Full LLM, LoRA ViT
Repository	xtuner/llava-llama-3-8b-v1_1-transformers

What is llava-llama-3-8b-v1_1-transformers?

This is an advanced vision-language model that combines Meta's Llama 3 architecture with visual capabilities through CLIP integration. It's specifically designed for multimodal tasks, trained on ShareGPT4V-PT and InternVL-SFT datasets, making it particularly effective at understanding and describing images while maintaining strong language capabilities.

Implementation Details

The model employs a sophisticated architecture combining CLIP-L visual encoding with MLP projector, operating at 336x336 resolution. The training strategy involves a full LLM fine-tuning approach while using LoRA for the Vision Transformer component, optimizing both computational efficiency and performance.

Achieves 72.3% on MMBench Test (EN) and 66.4% on MMBench Test (CN)
Implements frozen LLM and frozen ViT for pretraining
Utilizes ShareGPT4V-PT (1246K) and InternVL-SFT (1268K) datasets

Core Capabilities

Strong performance on visual question answering tasks
Advanced image understanding and description generation
Excellent results on AI2D Test (70.0%) and ScienceQA Test (72.9%)
Robust performance on multilingual tasks

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its integration of Llama 3 architecture with CLIP vision capabilities, showing significant improvements over its predecessors, particularly in benchmarks like MMBench and AI2D Test. The combination of ShareGPT4V-PT and InternVL-SFT training data provides it with robust multimodal understanding.

Q: What are the recommended use cases?

The model excels in visual question answering, image description, and multimodal understanding tasks. It's particularly suited for applications requiring detailed image analysis, scientific diagram understanding, and multilingual visual reasoning.