LLaVA-Llama-3-8B-v1.1 Transformers
Property | Value |
---|---|
Base Model | Meta-Llama-3-8B-Instruct |
Visual Encoder | CLIP-ViT-Large-patch14-336 |
Training Strategy | Full LLM, LoRA ViT |
Repository | xtuner/llava-llama-3-8b-v1_1-transformers |
What is llava-llama-3-8b-v1_1-transformers?
This is an advanced vision-language model that combines Meta's Llama 3 architecture with visual capabilities through CLIP integration. It's specifically designed for multimodal tasks, trained on ShareGPT4V-PT and InternVL-SFT datasets, making it particularly effective at understanding and describing images while maintaining strong language capabilities.
Implementation Details
The model employs a sophisticated architecture combining CLIP-L visual encoding with MLP projector, operating at 336x336 resolution. The training strategy involves a full LLM fine-tuning approach while using LoRA for the Vision Transformer component, optimizing both computational efficiency and performance.
- Achieves 72.3% on MMBench Test (EN) and 66.4% on MMBench Test (CN)
- Implements frozen LLM and frozen ViT for pretraining
- Utilizes ShareGPT4V-PT (1246K) and InternVL-SFT (1268K) datasets
Core Capabilities
- Strong performance on visual question answering tasks
- Advanced image understanding and description generation
- Excellent results on AI2D Test (70.0%) and ScienceQA Test (72.9%)
- Robust performance on multilingual tasks
Frequently Asked Questions
Q: What makes this model unique?
This model stands out due to its integration of Llama 3 architecture with CLIP vision capabilities, showing significant improvements over its predecessors, particularly in benchmarks like MMBench and AI2D Test. The combination of ShareGPT4V-PT and InternVL-SFT training data provides it with robust multimodal understanding.
Q: What are the recommended use cases?
The model excels in visual question answering, image description, and multimodal understanding tasks. It's particularly suited for applications requiring detailed image analysis, scientific diagram understanding, and multilingual visual reasoning.