granite-vision-3.2-2b

Maintained By
ibm-granite

granite-vision-3.2-2b

PropertyValue
AuthorIBM Granite
Release DateFebruary 26th, 2025
LicenseApache 2.0
Parameters2 billion
FrameworkTransformers (>=4.49)

What is granite-vision-3.2-2b?

granite-vision-3.2-2b is a cutting-edge vision-language model specifically designed for enterprise document understanding tasks. Built by IBM, this model excels at extracting content from tables, charts, infographics, and various document formats. It achieves state-of-the-art performance across multiple document understanding benchmarks, notably scoring 0.89 on DocVQA and 0.87 on ChartQA.

Implementation Details

The model architecture combines three key components: a SigLIP vision encoder, a two-layer MLP vision-language connector with GELU activation, and the granite-3.1-2b-instruct language model featuring 128k context length. The implementation builds upon LLaVA's foundation, incorporating multi-layer encoder features and enhanced grid resolution for superior document comprehension.

  • Trained on IBM's Blue Vela supercomputing cluster with H100 GPUs
  • Supports both transformers and vLLM deployment options
  • Includes built-in safety features and compatibility with Granite Guardian

Core Capabilities

  • Document visual question answering (DocVQA: 0.89)
  • Chart and graph interpretation (ChartQA: 0.87)
  • Text extraction and analysis (TextVQA: 0.78)
  • General visual understanding (VQAv2: 0.78)
  • Real-world image analysis (RealWorldQA: 0.63)

Frequently Asked Questions

Q: What makes this model unique?

The model's specialization in document understanding, combined with its compact 2B parameter size, makes it particularly efficient for enterprise applications. It outperforms comparable models in document-related tasks while maintaining strong general vision capabilities.

Q: What are the recommended use cases?

The model is ideal for enterprise applications involving document processing, including analyzing tables, charts, performing OCR, and answering questions about document content. For text-only tasks, it's recommended to use the dedicated Granite language models instead.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.