CLIP-convnext_base_w-laion2B-s13B-b82K

Property	Value
License	MIT
Training Dataset	LAION-2B
Resolution	256x256
Zero-Shot ImageNet Accuracy	70.8%
Paper	ConvNeXt Paper

What is CLIP-convnext_base_w-laion2B-s13B-b82K?

This model represents a significant advancement in CLIP architecture, utilizing ConvNeXt-Base as its backbone. It's trained on LAION-2B dataset with 13B samples, offering an efficient alternative to traditional ViT and ResNet implementations. The model achieves impressive 70.8% zero-shot accuracy on ImageNet, demonstrating superior sample efficiency compared to ViT-B/16.

Implementation Details

The model combines a ConvNeXt-Base image tower with a text tower equivalent to RN50x4 (depth 12, embed dim 640) from OpenAI CLIP. Training was conducted with a global batch size of 81920 and utilized advanced techniques like Random Resize Crop and precision amp_bfloat16.

Training Resolution: 256x256
Batch Size: 81920
Learning Rate: 1e-3
Training Samples: ~13B
Architecture: ConvNeXt-Base with wide embed dim

Core Capabilities

Zero-shot image classification
Image and text retrieval
Efficient scaling with model size and image resolution
Robust performance across varying resolutions
Compatible with downstream task fine-tuning

Frequently Asked Questions

Q: What makes this model unique?

This is the first known ConvNeXt CLIP model trained at scale, achieving comparable performance to CLIP ViT-B/16 and RN50x4 models while potentially being more sample efficient.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for downstream task fine-tuning. However, it's not recommended for deployed commercial applications without thorough testing.