CLIP-convnext_base_w-laion2B-s13B-b82K

Maintained By
laion

CLIP-convnext_base_w-laion2B-s13B-b82K

PropertyValue
LicenseMIT
Training DatasetLAION-2B
Resolution256x256
Zero-Shot ImageNet Accuracy70.8%
PaperConvNeXt Paper

What is CLIP-convnext_base_w-laion2B-s13B-b82K?

This model represents a significant advancement in CLIP architecture, utilizing ConvNeXt-Base as its backbone. It's trained on LAION-2B dataset with 13B samples, offering an efficient alternative to traditional ViT and ResNet implementations. The model achieves impressive 70.8% zero-shot accuracy on ImageNet, demonstrating superior sample efficiency compared to ViT-B/16.

Implementation Details

The model combines a ConvNeXt-Base image tower with a text tower equivalent to RN50x4 (depth 12, embed dim 640) from OpenAI CLIP. Training was conducted with a global batch size of 81920 and utilized advanced techniques like Random Resize Crop and precision amp_bfloat16.

  • Training Resolution: 256x256
  • Batch Size: 81920
  • Learning Rate: 1e-3
  • Training Samples: ~13B
  • Architecture: ConvNeXt-Base with wide embed dim

Core Capabilities

  • Zero-shot image classification
  • Image and text retrieval
  • Efficient scaling with model size and image resolution
  • Robust performance across varying resolutions
  • Compatible with downstream task fine-tuning

Frequently Asked Questions

Q: What makes this model unique?

This is the first known ConvNeXt CLIP model trained at scale, achieving comparable performance to CLIP ViT-B/16 and RN50x4 models while potentially being more sample efficient.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for downstream task fine-tuning. However, it's not recommended for deployed commercial applications without thorough testing.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.