Llama-3.2V-11B-cot

Maintained By
Xkev

Llama-3.2V-11B-cot

PropertyValue
Parameter Count10.7B
LicenseApache-2.0
Base Modelmeta-llama/Llama-3.2-11B-Vision-Instruct
PaperLLaVA-o1: Let Vision Language Models Reason Step-by-Step
Average Benchmark Score63.5%

What is Llama-3.2V-11B-cot?

Llama-3.2V-11B-cot represents the first iteration of LLaVA-o1, an advanced visual language model designed specifically for systematic reasoning tasks. Built upon the Llama-3.2-11B-Vision-Instruct architecture, this model introduces enhanced capabilities for processing and analyzing visual information through a step-by-step reasoning approach.

Implementation Details

The model is implemented using the transformers library and operates with F32 tensor precision. It's trained using FSDP (Fully Sharded Data Parallel) with specific hyperparameters including a learning rate of 1e-5, 3 epochs, and a context length of 4096 tokens. The model utilizes mixed precision training with specific output formatting requirements between <CONCLUSION> tags.

  • Trained on the proprietary LLaVA-o1-100k dataset
  • Implements advanced batching strategy with padding
  • Supports maximum 2048 new tokens generation
  • Uses temperature of 0.6 and top_p of 0.9 for inference

Core Capabilities

  • Visual-language understanding and reasoning
  • Strong performance on AI2D (85.7%) and MMBench (75.0%)
  • Systematic step-by-step analysis of visual inputs
  • Balanced performance across multiple benchmarks including MMVet and MathVista

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its systematic reasoning capabilities in visual language tasks, implementing a step-by-step approach that enhances its analytical abilities compared to traditional vision-language models.

Q: What are the recommended use cases?

The model is particularly well-suited for tasks requiring detailed visual analysis, including educational applications (given its strong AI2D performance), technical documentation analysis, and general visual reasoning tasks where step-by-step deduction is valuable.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.