Llama-3.2V-11B-cot
Property | Value |
---|---|
Parameter Count | 10.7B |
License | Apache-2.0 |
Base Model | meta-llama/Llama-3.2-11B-Vision-Instruct |
Paper | LLaVA-o1: Let Vision Language Models Reason Step-by-Step |
Average Benchmark Score | 63.5% |
What is Llama-3.2V-11B-cot?
Llama-3.2V-11B-cot represents the first iteration of LLaVA-o1, an advanced visual language model designed specifically for systematic reasoning tasks. Built upon the Llama-3.2-11B-Vision-Instruct architecture, this model introduces enhanced capabilities for processing and analyzing visual information through a step-by-step reasoning approach.
Implementation Details
The model is implemented using the transformers library and operates with F32 tensor precision. It's trained using FSDP (Fully Sharded Data Parallel) with specific hyperparameters including a learning rate of 1e-5, 3 epochs, and a context length of 4096 tokens. The model utilizes mixed precision training with specific output formatting requirements between <CONCLUSION> tags.
- Trained on the proprietary LLaVA-o1-100k dataset
- Implements advanced batching strategy with padding
- Supports maximum 2048 new tokens generation
- Uses temperature of 0.6 and top_p of 0.9 for inference
Core Capabilities
- Visual-language understanding and reasoning
- Strong performance on AI2D (85.7%) and MMBench (75.0%)
- Systematic step-by-step analysis of visual inputs
- Balanced performance across multiple benchmarks including MMVet and MathVista
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its systematic reasoning capabilities in visual language tasks, implementing a step-by-step approach that enhances its analytical abilities compared to traditional vision-language models.
Q: What are the recommended use cases?
The model is particularly well-suited for tasks requiring detailed visual analysis, including educational applications (given its strong AI2D performance), technical documentation analysis, and general visual reasoning tasks where step-by-step deduction is valuable.