granite-vision-3.1-2b-preview

ibm-granite

Compact 3B-parameter vision-language model optimized for document understanding, featuring strong performance on chart/table analysis and general VQA tasks. Built on Granite LLM.

Property	Value
Parameter Count	3 billion
License	Apache 2.0
Release Date	Jan 31st, 2025
Languages	English

What is granite-vision-3.1-2b-preview?

Granite Vision 3.1 is a sophisticated vision-language model specifically engineered for enterprise document understanding and analysis. Built on IBM's Granite LLM architecture, this model excels at processing visual documents, including tables, charts, infographics, and diagrams. With 3 billion parameters, it achieves state-of-the-art performance across various document understanding benchmarks while maintaining computational efficiency.

Implementation Details

The model architecture combines three key components: a SigLIP vision encoder, a two-layer MLP vision-language connector with GELU activation, and the granite-3.1-2b-instruct language model featuring 128k context length. The implementation builds upon LLaVA architecture with enhanced multi-layer encoder features and denser grid resolution for improved document interpretation.

Trained on IBM's Blue Vela supercomputing cluster with H100 GPUs
Leverages both public datasets and synthetic data for document understanding
Implements advanced safety features through Granite Guardian integration

Core Capabilities

Superior performance on ChartQA (0.86) and DocVQA (0.88) benchmarks
Enhanced OCR capabilities with 0.75 score on OCRBench
Strong general VQA performance (0.81 on VQAv2)
Specialized in table and chart analysis
Document content extraction and comprehension

Frequently Asked Questions

Q: What makes this model unique?

The model's specialized focus on document understanding, combined with its compact size and superior benchmark performance, makes it particularly valuable for enterprise applications. Its integration with Granite Guardian for safety monitoring sets it apart from similar models.

Q: What are the recommended use cases?

The model is ideal for enterprise applications involving document analysis, including table extraction, chart interpretation, and OCR tasks. It's particularly well-suited for automated content extraction from business documents, technical diagrams, and financial reports.