llava-llama-3-8b-v1_1-transformers

llava-llama-3-8b-v1_1-transformers

xtuner

LLaVA model fine-tuned from Meta-Llama-3-8B-Instruct with CLIP-ViT integration. Excels in vision-language tasks with strong performance on MMBench and other benchmarks.

PropertyValue
Base ModelMeta-Llama-3-8B-Instruct
Visual EncoderCLIP-ViT-Large-patch14-336
Training StrategyFull LLM, LoRA ViT
Repositoryxtuner/llava-llama-3-8b-v1_1-transformers

What is llava-llama-3-8b-v1_1-transformers?

This is an advanced vision-language model that combines Meta's Llama 3 architecture with visual capabilities through CLIP integration. It's specifically designed for multimodal tasks, trained on ShareGPT4V-PT and InternVL-SFT datasets, making it particularly effective at understanding and describing images while maintaining strong language capabilities.

Implementation Details

The model employs a sophisticated architecture combining CLIP-L visual encoding with MLP projector, operating at 336x336 resolution. The training strategy involves a full LLM fine-tuning approach while using LoRA for the Vision Transformer component, optimizing both computational efficiency and performance.

  • Achieves 72.3% on MMBench Test (EN) and 66.4% on MMBench Test (CN)
  • Implements frozen LLM and frozen ViT for pretraining
  • Utilizes ShareGPT4V-PT (1246K) and InternVL-SFT (1268K) datasets

Core Capabilities

  • Strong performance on visual question answering tasks
  • Advanced image understanding and description generation
  • Excellent results on AI2D Test (70.0%) and ScienceQA Test (72.9%)
  • Robust performance on multilingual tasks

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its integration of Llama 3 architecture with CLIP vision capabilities, showing significant improvements over its predecessors, particularly in benchmarks like MMBench and AI2D Test. The combination of ShareGPT4V-PT and InternVL-SFT training data provides it with robust multimodal understanding.

Q: What are the recommended use cases?

The model excels in visual question answering, image description, and multimodal understanding tasks. It's particularly suited for applications requiring detailed image analysis, scientific diagram understanding, and multilingual visual reasoning.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026