BEiT Base Patch16 224 (ImageNet-22k)
Property | Value |
---|---|
License | Apache 2.0 |
Paper | BEIT: BERT Pre-Training of Image Transformers |
Training Data | ImageNet-22k (14M images, 21,841 classes) |
Input Resolution | 224x224 pixels |
What is beit-base-patch16-224-pt22k-ft22k?
BEiT is a BERT-style vision transformer model that revolutionizes image classification through self-supervised pre-training. This specific variant is pre-trained and fine-tuned on ImageNet-22k, processing images as 16x16 pixel patches at 224x224 resolution. It employs a unique approach using visual tokens from DALL-E's VQ-VAE encoder for masked patch prediction.
Implementation Details
The model architecture follows a transformer encoder design, incorporating several innovative features compared to traditional ViT models. It uses relative position embeddings instead of absolute positions, and performs classification through mean-pooling of patch embeddings rather than using a [CLS] token.
- Pre-trained on ImageNet-22k with 14 million images
- Uses 16x16 pixel patches for image processing
- Implements relative position embeddings similar to T5
- Normalizes images with mean and std of (0.5, 0.5, 0.5)
Core Capabilities
- Image classification across 21,841 classes
- Feature extraction for downstream tasks
- Self-supervised learning from masked patches
- Efficient processing of 224x224 resolution images
Frequently Asked Questions
Q: What makes this model unique?
This model stands out through its BERT-style pre-training approach for vision tasks, using masked patch prediction and relative position embeddings, making it particularly effective for transfer learning on image classification tasks.
Q: What are the recommended use cases?
The model is ideal for image classification tasks, feature extraction, and transfer learning applications. It's particularly suitable for scenarios requiring classification among a large number of categories, thanks to its training on ImageNet-22k.