EVA-CLIP

Property	Value
License	MIT
Paper	arXiv:2303.15389
Author	QuanSun

What is EVA-CLIP?

EVA-CLIP is a series of state-of-the-art vision-language models that achieves exceptional performance in zero-shot classification tasks. The model family includes various sizes, from the efficient EVA02_CLIP_B_psz16_s8B (149M parameters) to the powerful EVA02_CLIP_E_psz14_plus_s9B (5.0B parameters).

Implementation Details

The EVA-CLIP series is trained using different precision formats (fp16 and bf16) on massive datasets including LAION-400M, LAION-2B, and a custom Merged-2B dataset. Training utilized extensive computational resources, ranging from 64 to 256 A100 GPUs depending on the model variant.

Multiple architecture variants available (Base, Large, and Enormous)
Training batch sizes ranging from 41K to 144K
Advanced model interpolation techniques for patch embedding and position embedding

Core Capabilities

State-of-the-art zero-shot classification performance on ImageNet (up to 82.0% top-1)
Superior MSCOCO Text-to-Image retrieval (up to 75.0% R@5)
Scalable architecture supporting various model sizes for different requirements
Efficient training through MIM teacher-student framework

Frequently Asked Questions

Q: What makes this model unique?

EVA-CLIP represents the most performant open-sourced CLIP models across all scales, particularly excelling in zero-shot classification tasks on mainstream benchmarks like ImageNet and its variants.

Q: What are the recommended use cases?

The model is particularly well-suited for zero-shot image classification, text-to-image retrieval, and general vision-language tasks. Different model sizes allow for deployment in various scenarios, from resource-constrained environments to high-performance requirements.

EVA-CLIP

EVA-CLIP

What is EVA-CLIP?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models