CLIP-ViT-B-32-DataComp.XL-s13B-b90K

CLIP-ViT-B-32-DataComp.XL-s13B-b90K

laion

CLIP Vision Transformer model trained on DataComp-1B dataset, achieving 72.7% ImageNet accuracy. Optimized for zero-shot classification and retrieval tasks.

PropertyValue
LicenseMIT
FrameworkOpenCLIP
PaperDataComp Paper
Training DataDataComp-1B (1.4B samples)

What is CLIP-ViT-B-32-DataComp.XL-s13B-b90K?

This is a Vision Transformer model trained using the CLIP framework on the massive DataComp-1B dataset. The model utilizes a ViT-B/32 architecture and was trained at stability.ai. It achieves an impressive 72.7% zero-shot top-1 accuracy on ImageNet-1k, making it particularly powerful for zero-shot image classification tasks.

Implementation Details

The model is built on the OpenCLIP framework and trained with 1.4 billion samples from the DataComp-1B dataset. It's designed for efficient processing of image-text pairs and demonstrates strong performance across 38 different evaluation datasets.

  • Architecture: Vision Transformer Base model with 32x32 patches
  • Training Infrastructure: stability.ai cluster
  • Dataset: DataComp-1B with comprehensive evaluation suite

Core Capabilities

  • Zero-shot image classification
  • Image and text retrieval
  • Foundation for downstream task fine-tuning
  • Image generation guidance and conditioning

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its training on the carefully curated DataComp-1B dataset and its impressive zero-shot classification capabilities, achieving 72.7% accuracy on ImageNet-1k without any fine-tuning.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for downstream tasks. However, it's important to note that deployed commercial use cases are currently out of scope.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026