BEiT-v2 Base Vision Transformer

Property	Value
Parameter Count	102.6M
Architecture Type	Vision Transformer
Image Size	224x224 pixels
License	Apache-2.0
Paper	BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

What is beitv2_base_patch16_224.in1k_ft_in22k?

BEiT-v2 is an advanced vision transformer model that leverages self-supervised masked image modeling (MIM) using a VQ-KD encoder as a visual tokenizer. Initially trained on ImageNet-1k and subsequently fine-tuned on ImageNet-22k, this model represents a significant advancement in computer vision architectures.

Implementation Details

The model processes images by dividing them into 16x16 patches and utilizes a transformer architecture with 17.6 GMACs and 23.9M activations. It employs CLIP B/16 as a teacher model for visual tokenization, enabling efficient feature extraction and classification capabilities.

Base architecture with 16x16 patch size
Pre-trained using masked image modeling
Fine-tuned on ImageNet-22k dataset
Supports both classification and feature extraction

Core Capabilities

Image classification with high accuracy
Feature extraction and embedding generation
Support for 224x224 pixel input images
Efficient processing with optimized architecture

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its vector-quantized visual tokenizer approach and the combination of pre-training on ImageNet-1k with fine-tuning on ImageNet-22k, making it particularly robust for various computer vision tasks.

Q: What are the recommended use cases?

The model is ideal for image classification tasks and generating image embeddings. It's particularly well-suited for applications requiring high-quality feature extraction or transfer learning on downstream tasks.

beitv2_base_patch16_224.in1k_ft_in22k