Llama-3.2V-11B-cot

Property	Value
Parameter Count	10.7B
License	Apache-2.0
Base Model	meta-llama/Llama-3.2-11B-Vision-Instruct
Paper	LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Average Benchmark Score	63.5%

What is Llama-3.2V-11B-cot?

Llama-3.2V-11B-cot represents the first iteration of LLaVA-o1, an advanced visual language model designed specifically for systematic reasoning tasks. Built upon the Llama-3.2-11B-Vision-Instruct architecture, this model introduces enhanced capabilities for processing and analyzing visual information through a step-by-step reasoning approach.

Implementation Details

The model is implemented using the transformers library and operates with F32 tensor precision. It's trained using FSDP (Fully Sharded Data Parallel) with specific hyperparameters including a learning rate of 1e-5, 3 epochs, and a context length of 4096 tokens. The model utilizes mixed precision training with specific output formatting requirements between <CONCLUSION> tags.

Trained on the proprietary LLaVA-o1-100k dataset
Implements advanced batching strategy with padding
Supports maximum 2048 new tokens generation
Uses temperature of 0.6 and top_p of 0.9 for inference

Core Capabilities

Visual-language understanding and reasoning
Strong performance on AI2D (85.7%) and MMBench (75.0%)
Systematic step-by-step analysis of visual inputs
Balanced performance across multiple benchmarks including MMVet and MathVista

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its systematic reasoning capabilities in visual language tasks, implementing a step-by-step approach that enhances its analytical abilities compared to traditional vision-language models.

Q: What are the recommended use cases?

The model is particularly well-suited for tasks requiring detailed visual analysis, including educational applications (given its strong AI2D performance), technical documentation analysis, and general visual reasoning tasks where step-by-step deduction is valuable.

Llama-3.2V-11B-cot

Llama-3.2V-11B-cot

What is Llama-3.2V-11B-cot?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models