BEiT Base Patch16 224

Property	Value
Parameter Count	86.5M
License	Apache 2.0
Architecture	Vision Transformer (ViT)
Paper	BEiT: BERT Pre-Training of Image Transformers
Image Size	224 x 224

What is beit_base_patch16_224.in22k_ft_in22k_in1k?

This is a powerful vision transformer model that implements the BEiT (BERT Pre-training of Image Transformers) architecture. It was initially pre-trained on ImageNet-22k using self-supervised masked image modeling (MIM) with a DALL-E dVAE visual tokenizer, then fine-tuned sequentially on ImageNet-22k and ImageNet-1k datasets.

Implementation Details

The model processes images by dividing them into 16x16 patches and employs a transformer architecture with 86.5M parameters. It operates with 17.6 GMACs and generates 23.9M activations during processing. The architecture follows the proven vision transformer approach while incorporating BERT-style pre-training mechanisms.

Pre-trained using masked image modeling on ImageNet-22k
Fine-tuned on ImageNet-22k and ImageNet-1k
Uses 16x16 pixel patches for image processing
Implements DALL-E dVAE as visual tokenizer

Core Capabilities

Image Classification
Feature Extraction
Transfer Learning
Visual Representation Learning

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines BERT-style pre-training with vision transformers, using masked image modeling for self-supervised learning. The dual fine-tuning process on both ImageNet-22k and ImageNet-1k datasets provides robust visual understanding capabilities.

Q: What are the recommended use cases?

The model excels in image classification tasks and can be used for feature extraction in downstream computer vision applications. It's particularly well-suited for scenarios requiring rich visual understanding and transfer learning capabilities.