InternViT-300M-448px

OpenGVLab

A lightweight 304M parameter vision foundation model optimized for image feature extraction, supporting dynamic 448x448 resolution with multi-tile processing.

Property	Value
Parameter Count	304M
License	MIT
Image Size	448x448
Tensor Type	BF16
Paper	View Paper

What is InternViT-300M-448px?

InternViT-300M-448px is a compact vision foundation model developed by OpenGVLab through knowledge distillation from the larger InternViT-6B-448px-V1-5. This efficient model maintains powerful capabilities while significantly reducing the parameter count to just 304M, making it more accessible for various applications.

Implementation Details

The model operates on a dynamic input resolution of 448×448 pixels and supports processing of 1 to 12 tiles during training, expanding to 1-40 tiles during inference. It's built using transformer architecture and optimized for BF16 precision, offering an excellent balance between performance and resource utilization.

Pretrained on multiple datasets including LAION-en, LAION-zh, COYO, and specialized OCR datasets
Supports dynamic multi-tile processing for handling various image sizes
Implements efficient knowledge distillation techniques
Optimized for bfloat16 precision

Core Capabilities

High-quality image feature extraction
Robust OCR capabilities inherited from parent model
Efficient processing of high-resolution images
Flexible tile-based processing system

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines efficiency with powerful capabilities, offering the robust features of larger models while maintaining a relatively small 304M parameter count. Its ability to handle dynamic resolutions and multiple tiles makes it versatile for various vision tasks.

Q: What are the recommended use cases?

The model excels in image feature extraction tasks, OCR applications, and scenarios requiring high-resolution image processing. It's particularly suitable for applications where computational resources are limited but high-quality vision capabilities are needed.