EVA Giant Patch14-560
Property | Value |
---|---|
Parameter Count | 1.01B parameters |
Model Type | Image Classification / Feature Backbone |
Architecture | Vision Transformer (ViT) |
Input Size | 560 x 560 pixels |
GMACs | 1906.8 |
Paper | EVA: Exploring the Limits of Masked Visual Representation Learning at Scale |
What is eva_giant_patch14_560.m30m_ft_in22k_in1k?
This is a state-of-the-art vision transformer model that represents the pinnacle of EVA's architecture. It was pretrained on Merged-30M dataset (including ImageNet-22K, CC12M, CC3M, Object365, COCO, and ADE20K) using masked image modeling with CLIP-L as a teacher, then fine-tuned on ImageNet-22k and ImageNet-1k sequentially.
Implementation Details
The model features a giant architecture with 1.01B parameters and processes images at 560x560 resolution using 14x14 patch size. It demonstrates impressive computational efficiency with 1906.8 GMACs and manages 2577.2M activations during processing.
- Utilizes advanced masked visual representation learning
- Implements a hierarchical transformer architecture
- Achieves 89.792% top-1 accuracy on ImageNet-1k
- Supports both classification and feature extraction modes
Core Capabilities
- High-resolution image classification
- Feature extraction for downstream tasks
- Robust visual representation learning
- State-of-the-art performance on standard benchmarks
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its massive scale (1B+ parameters) and comprehensive pretraining on a merged dataset of 30M images, combined with an innovative masked image modeling approach using CLIP-L as a teacher.
Q: What are the recommended use cases?
The model is ideal for high-stakes image classification tasks, feature extraction for downstream applications, and scenarios where maximum accuracy is required. However, due to its size, it requires significant computational resources.