BEiT-v2 Base Vision Transformer
Property | Value |
---|---|
Parameter Count | 102.6M |
Architecture Type | Vision Transformer |
Image Size | 224x224 pixels |
License | Apache-2.0 |
Paper | BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers |
What is beitv2_base_patch16_224.in1k_ft_in22k?
BEiT-v2 is an advanced vision transformer model that leverages self-supervised masked image modeling (MIM) using a VQ-KD encoder as a visual tokenizer. Initially trained on ImageNet-1k and subsequently fine-tuned on ImageNet-22k, this model represents a significant advancement in computer vision architectures.
Implementation Details
The model processes images by dividing them into 16x16 patches and utilizes a transformer architecture with 17.6 GMACs and 23.9M activations. It employs CLIP B/16 as a teacher model for visual tokenization, enabling efficient feature extraction and classification capabilities.
- Base architecture with 16x16 patch size
- Pre-trained using masked image modeling
- Fine-tuned on ImageNet-22k dataset
- Supports both classification and feature extraction
Core Capabilities
- Image classification with high accuracy
- Feature extraction and embedding generation
- Support for 224x224 pixel input images
- Efficient processing with optimized architecture
Frequently Asked Questions
Q: What makes this model unique?
This model stands out due to its vector-quantized visual tokenizer approach and the combination of pre-training on ImageNet-1k with fine-tuning on ImageNet-22k, making it particularly robust for various computer vision tasks.
Q: What are the recommended use cases?
The model is ideal for image classification tasks and generating image embeddings. It's particularly well-suited for applications requiring high-quality feature extraction or transfer learning on downstream tasks.