EVA-CLIP
Property | Value |
---|---|
License | MIT |
Paper | arXiv:2303.15389 |
Author | QuanSun |
What is EVA-CLIP?
EVA-CLIP is a series of state-of-the-art vision-language models that achieves exceptional performance in zero-shot classification tasks. The model family includes various sizes, from the efficient EVA02_CLIP_B_psz16_s8B (149M parameters) to the powerful EVA02_CLIP_E_psz14_plus_s9B (5.0B parameters).
Implementation Details
The EVA-CLIP series is trained using different precision formats (fp16 and bf16) on massive datasets including LAION-400M, LAION-2B, and a custom Merged-2B dataset. Training utilized extensive computational resources, ranging from 64 to 256 A100 GPUs depending on the model variant.
- Multiple architecture variants available (Base, Large, and Enormous)
- Training batch sizes ranging from 41K to 144K
- Advanced model interpolation techniques for patch embedding and position embedding
Core Capabilities
- State-of-the-art zero-shot classification performance on ImageNet (up to 82.0% top-1)
- Superior MSCOCO Text-to-Image retrieval (up to 75.0% R@5)
- Scalable architecture supporting various model sizes for different requirements
- Efficient training through MIM teacher-student framework
Frequently Asked Questions
Q: What makes this model unique?
EVA-CLIP represents the most performant open-sourced CLIP models across all scales, particularly excelling in zero-shot classification tasks on mainstream benchmarks like ImageNet and its variants.
Q: What are the recommended use cases?
The model is particularly well-suited for zero-shot image classification, text-to-image retrieval, and general vision-language tasks. Different model sizes allow for deployment in various scenarios, from resource-constrained environments to high-performance requirements.