llava-llama-3-8b-v1_1

xtuner

LLaVA model fine-tuned from Meta-Llama-3-8B-Instruct with CLIP integration, optimized for image-text tasks. 8.03B params, strong MMBench performance.

Property	Value
Parameter Count	8.03B
Model Type	Image-Text-to-Text
Architecture	LLaVA with CLIP-ViT-Large
Tensor Type	FP16

What is llava-llama-3-8b-v1_1?

llava-llama-3-8b-v1_1 is an advanced multimodal model that combines Meta's Llama-3-8B-Instruct architecture with CLIP-ViT-Large visual encoding capabilities. It's specifically designed to handle complex image-text interactions, fine-tuned on ShareGPT4V-PT and InternVL-SFT datasets for enhanced performance.

Implementation Details

The model utilizes a sophisticated architecture combining a CLIP-L visual encoder with an MLP projector, operating at a resolution of 336. It employs a strategic training approach with frozen LLM and ViT during pretraining, followed by full LLM training with LoRA ViT during fine-tuning.

Visual Encoder: CLIP-ViT-Large-patch14-336
Base Model: meta-llama/Meta-Llama-3-8B-Instruct
Training Strategy: Full LLM with LoRA ViT fine-tuning
Dataset Size: 1246K pretraining + 1268K fine-tuning samples

Core Capabilities

72.3% accuracy on MMBench Test (EN)
66.4% accuracy on MMBench Test (CN)
70.0% accuracy on AI2D Test
Robust performance across multiple vision-language tasks
Enhanced multilingual capabilities

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its improved performance metrics compared to previous versions, particularly in MMBench and AI2D tests. It leverages a unique combination of ShareGPT4V-PT and InternVL-SFT datasets, resulting in better cross-modal understanding.

Q: What are the recommended use cases?

The model excels in vision-language tasks including visual question answering, image understanding, and multilingual image-text interactions. It's particularly suitable for applications requiring detailed image analysis and natural language responses.