BEiT Base Patch16 224
Property | Value |
---|---|
Parameter Count | 86.5M |
License | Apache 2.0 |
Architecture | Vision Transformer (ViT) |
Paper | BEiT: BERT Pre-Training of Image Transformers |
Image Size | 224 x 224 |
What is beit_base_patch16_224.in22k_ft_in22k_in1k?
This is a powerful vision transformer model that implements the BEiT (BERT Pre-training of Image Transformers) architecture. It was initially pre-trained on ImageNet-22k using self-supervised masked image modeling (MIM) with a DALL-E dVAE visual tokenizer, then fine-tuned sequentially on ImageNet-22k and ImageNet-1k datasets.
Implementation Details
The model processes images by dividing them into 16x16 patches and employs a transformer architecture with 86.5M parameters. It operates with 17.6 GMACs and generates 23.9M activations during processing. The architecture follows the proven vision transformer approach while incorporating BERT-style pre-training mechanisms.
- Pre-trained using masked image modeling on ImageNet-22k
- Fine-tuned on ImageNet-22k and ImageNet-1k
- Uses 16x16 pixel patches for image processing
- Implements DALL-E dVAE as visual tokenizer
Core Capabilities
- Image Classification
- Feature Extraction
- Transfer Learning
- Visual Representation Learning
Frequently Asked Questions
Q: What makes this model unique?
This model uniquely combines BERT-style pre-training with vision transformers, using masked image modeling for self-supervised learning. The dual fine-tuning process on both ImageNet-22k and ImageNet-1k datasets provides robust visual understanding capabilities.
Q: What are the recommended use cases?
The model excels in image classification tasks and can be used for feature extraction in downstream computer vision applications. It's particularly well-suited for scenarios requiring rich visual understanding and transfer learning capabilities.