BEiT Base Patch16 224 (ImageNet-22k)

Property	Value
License	Apache 2.0
Paper	BEIT: BERT Pre-Training of Image Transformers
Training Data	ImageNet-22k (14M images, 21,841 classes)
Input Resolution	224x224 pixels

What is beit-base-patch16-224-pt22k-ft22k?

BEiT is a BERT-style vision transformer model that revolutionizes image classification through self-supervised pre-training. This specific variant is pre-trained and fine-tuned on ImageNet-22k, processing images as 16x16 pixel patches at 224x224 resolution. It employs a unique approach using visual tokens from DALL-E's VQ-VAE encoder for masked patch prediction.

Implementation Details

The model architecture follows a transformer encoder design, incorporating several innovative features compared to traditional ViT models. It uses relative position embeddings instead of absolute positions, and performs classification through mean-pooling of patch embeddings rather than using a [CLS] token.

Pre-trained on ImageNet-22k with 14 million images
Uses 16x16 pixel patches for image processing
Implements relative position embeddings similar to T5
Normalizes images with mean and std of (0.5, 0.5, 0.5)

Core Capabilities

Image classification across 21,841 classes
Feature extraction for downstream tasks
Self-supervised learning from masked patches
Efficient processing of 224x224 resolution images

Frequently Asked Questions

Q: What makes this model unique?

This model stands out through its BERT-style pre-training approach for vision tasks, using masked patch prediction and relative position embeddings, making it particularly effective for transfer learning on image classification tasks.

Q: What are the recommended use cases?

The model is ideal for image classification tasks, feature extraction, and transfer learning applications. It's particularly suitable for scenarios requiring classification among a large number of categories, thanks to its training on ImageNet-22k.

beit-base-patch16-224-pt22k-ft22k