vit_base_patch16_384.augreg_in21k_ft_in1k

vit_base_patch16_384.augreg_in21k_ft_in1k

timm

Vision Transformer (ViT) model trained on ImageNet-21k & fine-tuned on ImageNet-1k. 86.9M params, 384x384 input, optimized for classification.

PropertyValue
Parameter Count86.9M
GMACs49.4
Input Size384 x 384
Training DataImageNet-21k + ImageNet-1k
PaperHow to train your ViT?

What is vit_base_patch16_384.augreg_in21k_ft_in1k?

This is a Vision Transformer (ViT) model that represents a significant advancement in computer vision. Initially trained on ImageNet-21k and fine-tuned on ImageNet-1k, it employs additional augmentation and regularization techniques to enhance performance. The model divides images into 16x16 patches and processes them through a transformer architecture, demonstrating impressive capabilities in image classification tasks.

Implementation Details

The model utilizes a base architecture with 86.9M parameters and operates on 384x384 pixel images. It's characterized by its patch-based approach to image processing, where each image is divided into 16x16 pixel patches before being processed by the transformer network. The model was implemented in JAX by the original authors and later ported to PyTorch by Ross Wightman.

  • Activation size: 48.3M parameters
  • Computational complexity: 49.4 GMACs
  • Supports both classification and feature extraction
  • Implements advanced augmentation and regularization techniques

Core Capabilities

  • Image Classification with high accuracy
  • Feature extraction for downstream tasks
  • Handles 384x384 resolution images
  • Provides both pooled and unpooled feature outputs

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its enhanced training regime with additional augmentation and regularization, combined with its two-stage training process (pretraining on ImageNet-21k and fine-tuning on ImageNet-1k). The 384x384 input resolution allows for processing of higher detail images compared to smaller variants.

Q: What are the recommended use cases?

The model is particularly well-suited for high-resolution image classification tasks, feature extraction for transfer learning, and as a backbone for various computer vision applications. It's especially effective when working with detailed images that benefit from the larger 384x384 input resolution.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026