EVA-CLIP

Maintained By
QuanSun

EVA-CLIP

PropertyValue
LicenseMIT
PaperarXiv:2303.15389
AuthorQuanSun

What is EVA-CLIP?

EVA-CLIP is a series of state-of-the-art vision-language models that achieves exceptional performance in zero-shot classification tasks. The model family includes various sizes, from the efficient EVA02_CLIP_B_psz16_s8B (149M parameters) to the powerful EVA02_CLIP_E_psz14_plus_s9B (5.0B parameters).

Implementation Details

The EVA-CLIP series is trained using different precision formats (fp16 and bf16) on massive datasets including LAION-400M, LAION-2B, and a custom Merged-2B dataset. Training utilized extensive computational resources, ranging from 64 to 256 A100 GPUs depending on the model variant.

  • Multiple architecture variants available (Base, Large, and Enormous)
  • Training batch sizes ranging from 41K to 144K
  • Advanced model interpolation techniques for patch embedding and position embedding

Core Capabilities

  • State-of-the-art zero-shot classification performance on ImageNet (up to 82.0% top-1)
  • Superior MSCOCO Text-to-Image retrieval (up to 75.0% R@5)
  • Scalable architecture supporting various model sizes for different requirements
  • Efficient training through MIM teacher-student framework

Frequently Asked Questions

Q: What makes this model unique?

EVA-CLIP represents the most performant open-sourced CLIP models across all scales, particularly excelling in zero-shot classification tasks on mainstream benchmarks like ImageNet and its variants.

Q: What are the recommended use cases?

The model is particularly well-suited for zero-shot image classification, text-to-image retrieval, and general vision-language tasks. Different model sizes allow for deployment in various scenarios, from resource-constrained environments to high-performance requirements.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.