CLIP-ViT-B-16-DataComp.XL-s13B-b90K

CLIP-ViT-B-16-DataComp.XL-s13B-b90K

laion

CLIP ViT-B/16 model trained on DataComp-1B dataset, achieving 73.5% ImageNet accuracy. Specialized for zero-shot image classification and retrieval tasks.

PropertyValue
LicenseMIT
Training DataDataComp-1B (1.4B samples)
ImageNet Accuracy73.5% (zero-shot)
Research PaperDataComp Paper

What is CLIP-ViT-B-16-DataComp.XL-s13B-b90K?

This is a CLIP Vision Transformer (ViT-B/16) model trained on the extensive DataComp-1B dataset using OpenCLIP framework. Developed by LAION, it represents a significant advancement in zero-shot image classification capabilities, trained on Stability.ai's infrastructure.

Implementation Details

The model leverages the Vision Transformer architecture with a 16x16 patch size, trained on a carefully curated dataset of 1.4 billion samples. It's built using the OpenCLIP framework, which enables efficient training of large-scale vision-language models.

  • Achieves 73.5% zero-shot top-1 accuracy on ImageNet-1k
  • Evaluated on 38 different datasets
  • Built for research and experimental purposes

Core Capabilities

  • Zero-shot image classification
  • Image and text retrieval
  • Transfer learning for downstream tasks
  • Image generation guidance and conditioning

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its training on the massive DataComp-1B dataset and its impressive zero-shot classification capabilities, achieving 73.5% accuracy on ImageNet without specific training for the task.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for transfer learning. However, it's not recommended for deployed commercial applications without thorough testing.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026