InternViT-6B-448px-V1-5

OpenGVLab

A powerful 5.54B parameter vision foundation model trained on multiple datasets, supporting dynamic 448x448 resolution with enhanced OCR capabilities and high-res processing.

Property	Value
Parameter Count	5.54B
License	MIT
Paper	arXiv:2404.16821
Model Type	Vision Foundation Model
Tensor Type	BF16

What is InternViT-6B-448px-V1-5?

InternViT-6B-448px-V1-5 is an advanced vision foundation model that builds upon its predecessor V1-2. It represents a significant evolution in visual processing capabilities, featuring dynamic resolution handling and enhanced OCR capabilities. The model has been streamlined from 48 to 45 blocks, optimizing its parameter count from 5.9B to 5.54B while maintaining exceptional performance.

Implementation Details

The model operates on a dynamic 448×448 pixel resolution with support for 1-12 tiles, allowing for flexible image processing. It's trained on a diverse set of datasets including LAION-en, LAION-zh, COYO, and various OCR-specific datasets. The architecture employs BF16 precision and is specifically optimized for MLLM (Multi-Modal Language Model) applications.

Optimized architecture with 45 transformer blocks
Dynamic resolution support with 1-12 tiles
Enhanced OCR capabilities through specialized training
Efficient BF16 tensor format implementation

Core Capabilities

High-resolution image processing
Robust OCR functionality for both English and Chinese text
Feature extraction for downstream tasks
Efficient memory utilization through architectural optimization

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to handle dynamic resolutions with multiple tiles, combined with its enhanced OCR capabilities and optimized architecture, makes it particularly suited for complex vision tasks while maintaining efficiency.

Q: What are the recommended use cases?

This model excels in image feature extraction, OCR tasks, and as a backbone for larger multi-modal systems. It's particularly well-suited for applications requiring high-resolution image understanding and text recognition capabilities.