CLIP-ViT-B-16-DataComp.XL-s13B-b90K

Maintained By
laion

CLIP-ViT-B-16-DataComp.XL-s13B-b90K

PropertyValue
LicenseMIT
Training DataDataComp-1B (1.4B samples)
ImageNet Accuracy73.5% (zero-shot)
Research PaperDataComp Paper

What is CLIP-ViT-B-16-DataComp.XL-s13B-b90K?

This is a CLIP Vision Transformer (ViT-B/16) model trained on the extensive DataComp-1B dataset using OpenCLIP framework. Developed by LAION, it represents a significant advancement in zero-shot image classification capabilities, trained on Stability.ai's infrastructure.

Implementation Details

The model leverages the Vision Transformer architecture with a 16x16 patch size, trained on a carefully curated dataset of 1.4 billion samples. It's built using the OpenCLIP framework, which enables efficient training of large-scale vision-language models.

  • Achieves 73.5% zero-shot top-1 accuracy on ImageNet-1k
  • Evaluated on 38 different datasets
  • Built for research and experimental purposes

Core Capabilities

  • Zero-shot image classification
  • Image and text retrieval
  • Transfer learning for downstream tasks
  • Image generation guidance and conditioning

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its training on the massive DataComp-1B dataset and its impressive zero-shot classification capabilities, achieving 73.5% accuracy on ImageNet without specific training for the task.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for transfer learning. However, it's not recommended for deployed commercial applications without thorough testing.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.