CLIP-ViT-B-16-DataComp.XL-s13B-b90K
Property | Value |
---|---|
License | MIT |
Training Data | DataComp-1B (1.4B samples) |
ImageNet Accuracy | 73.5% (zero-shot) |
Research Paper | DataComp Paper |
What is CLIP-ViT-B-16-DataComp.XL-s13B-b90K?
This is a CLIP Vision Transformer (ViT-B/16) model trained on the extensive DataComp-1B dataset using OpenCLIP framework. Developed by LAION, it represents a significant advancement in zero-shot image classification capabilities, trained on Stability.ai's infrastructure.
Implementation Details
The model leverages the Vision Transformer architecture with a 16x16 patch size, trained on a carefully curated dataset of 1.4 billion samples. It's built using the OpenCLIP framework, which enables efficient training of large-scale vision-language models.
- Achieves 73.5% zero-shot top-1 accuracy on ImageNet-1k
- Evaluated on 38 different datasets
- Built for research and experimental purposes
Core Capabilities
- Zero-shot image classification
- Image and text retrieval
- Transfer learning for downstream tasks
- Image generation guidance and conditioning
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its training on the massive DataComp-1B dataset and its impressive zero-shot classification capabilities, achieving 73.5% accuracy on ImageNet without specific training for the task.
Q: What are the recommended use cases?
The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for transfer learning. However, it's not recommended for deployed commercial applications without thorough testing.