CLIP-ViT-B-32-DataComp.XL-s13B-b90K

Maintained By
laion

CLIP-ViT-B-32-DataComp.XL-s13B-b90K

PropertyValue
LicenseMIT
FrameworkOpenCLIP
PaperDataComp Paper
Training DataDataComp-1B (1.4B samples)

What is CLIP-ViT-B-32-DataComp.XL-s13B-b90K?

This is a Vision Transformer model trained using the CLIP framework on the massive DataComp-1B dataset. The model utilizes a ViT-B/32 architecture and was trained at stability.ai. It achieves an impressive 72.7% zero-shot top-1 accuracy on ImageNet-1k, making it particularly powerful for zero-shot image classification tasks.

Implementation Details

The model is built on the OpenCLIP framework and trained with 1.4 billion samples from the DataComp-1B dataset. It's designed for efficient processing of image-text pairs and demonstrates strong performance across 38 different evaluation datasets.

  • Architecture: Vision Transformer Base model with 32x32 patches
  • Training Infrastructure: stability.ai cluster
  • Dataset: DataComp-1B with comprehensive evaluation suite

Core Capabilities

  • Zero-shot image classification
  • Image and text retrieval
  • Foundation for downstream task fine-tuning
  • Image generation guidance and conditioning

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its training on the carefully curated DataComp-1B dataset and its impressive zero-shot classification capabilities, achieving 72.7% accuracy on ImageNet-1k without any fine-tuning.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for downstream tasks. However, it's important to note that deployed commercial use cases are currently out of scope.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.