granite-vision-3.2-2b

Property	Value
Author	IBM Granite
Release Date	February 26th, 2025
License	Apache 2.0
Parameters	2 billion
Framework	Transformers (>=4.49)

What is granite-vision-3.2-2b?

granite-vision-3.2-2b is a cutting-edge vision-language model specifically designed for enterprise document understanding tasks. Built by IBM, this model excels at extracting content from tables, charts, infographics, and various document formats. It achieves state-of-the-art performance across multiple document understanding benchmarks, notably scoring 0.89 on DocVQA and 0.87 on ChartQA.

Implementation Details

The model architecture combines three key components: a SigLIP vision encoder, a two-layer MLP vision-language connector with GELU activation, and the granite-3.1-2b-instruct language model featuring 128k context length. The implementation builds upon LLaVA's foundation, incorporating multi-layer encoder features and enhanced grid resolution for superior document comprehension.

Trained on IBM's Blue Vela supercomputing cluster with H100 GPUs
Supports both transformers and vLLM deployment options
Includes built-in safety features and compatibility with Granite Guardian

Core Capabilities

Document visual question answering (DocVQA: 0.89)
Chart and graph interpretation (ChartQA: 0.87)
Text extraction and analysis (TextVQA: 0.78)
General visual understanding (VQAv2: 0.78)
Real-world image analysis (RealWorldQA: 0.63)

Frequently Asked Questions

Q: What makes this model unique?

The model's specialization in document understanding, combined with its compact 2B parameter size, makes it particularly efficient for enterprise applications. It outperforms comparable models in document-related tasks while maintaining strong general vision capabilities.

Q: What are the recommended use cases?

The model is ideal for enterprise applications involving document processing, including analyzing tables, charts, performing OCR, and answering questions about document content. For text-only tasks, it's recommended to use the dedicated Granite language models instead.