beitv2_base_patch16_224.in1k_ft_in22k

Maintained By
timm

BEiT-v2 Base Vision Transformer

PropertyValue
Parameter Count102.6M
Architecture TypeVision Transformer
Image Size224x224 pixels
LicenseApache-2.0
PaperBEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

What is beitv2_base_patch16_224.in1k_ft_in22k?

BEiT-v2 is an advanced vision transformer model that leverages self-supervised masked image modeling (MIM) using a VQ-KD encoder as a visual tokenizer. Initially trained on ImageNet-1k and subsequently fine-tuned on ImageNet-22k, this model represents a significant advancement in computer vision architectures.

Implementation Details

The model processes images by dividing them into 16x16 patches and utilizes a transformer architecture with 17.6 GMACs and 23.9M activations. It employs CLIP B/16 as a teacher model for visual tokenization, enabling efficient feature extraction and classification capabilities.

  • Base architecture with 16x16 patch size
  • Pre-trained using masked image modeling
  • Fine-tuned on ImageNet-22k dataset
  • Supports both classification and feature extraction

Core Capabilities

  • Image classification with high accuracy
  • Feature extraction and embedding generation
  • Support for 224x224 pixel input images
  • Efficient processing with optimized architecture

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its vector-quantized visual tokenizer approach and the combination of pre-training on ImageNet-1k with fine-tuning on ImageNet-22k, making it particularly robust for various computer vision tasks.

Q: What are the recommended use cases?

The model is ideal for image classification tasks and generating image embeddings. It's particularly well-suited for applications requiring high-quality feature extraction or transfer learning on downstream tasks.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.