Vision Transformer (ViT) Base Patch16 384

Property	Value
Parameter Count	86.9M
GMACs	49.4
Input Size	384 x 384
Training Data	ImageNet-21k + ImageNet-1k
Paper	How to train your ViT?

What is vit_base_patch16_384.augreg_in21k_ft_in1k?

This is a Vision Transformer (ViT) model that represents a significant advancement in computer vision. Initially trained on ImageNet-21k and fine-tuned on ImageNet-1k, it employs additional augmentation and regularization techniques to enhance performance. The model divides images into 16x16 patches and processes them through a transformer architecture, demonstrating impressive capabilities in image classification tasks.

Implementation Details

The model utilizes a base architecture with 86.9M parameters and operates on 384x384 pixel images. It's characterized by its patch-based approach to image processing, where each image is divided into 16x16 pixel patches before being processed by the transformer network. The model was implemented in JAX by the original authors and later ported to PyTorch by Ross Wightman.

Activation size: 48.3M parameters
Computational complexity: 49.4 GMACs
Supports both classification and feature extraction
Implements advanced augmentation and regularization techniques

Core Capabilities

Image Classification with high accuracy
Feature extraction for downstream tasks
Handles 384x384 resolution images
Provides both pooled and unpooled feature outputs

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its enhanced training regime with additional augmentation and regularization, combined with its two-stage training process (pretraining on ImageNet-21k and fine-tuning on ImageNet-1k). The 384x384 input resolution allows for processing of higher detail images compared to smaller variants.

Q: What are the recommended use cases?

The model is particularly well-suited for high-resolution image classification tasks, feature extraction for transfer learning, and as a backbone for various computer vision applications. It's especially effective when working with detailed images that benefit from the larger 384x384 input resolution.

vit_base_patch16_384.augreg_in21k_ft_in1k