CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg

Maintained By
laion

CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg

PropertyValue
Total Parameters1.2B
LicenseMIT
Training DataLAION-2B
Image Resolution256x256
Zero-shot ImageNet Accuracy79.1%

What is CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg?

This is a groundbreaking CLIP model that uses the ConvNeXt-XXLarge architecture as its image tower, representing the largest released ConvNeXt model pretrained with 847M parameters. Trained on the LAION-2B dataset, it achieves impressive zero-shot classification capabilities without requiring previous image tower pretraining.

Implementation Details

The model combines a ConvNeXt-XXLarge image tower with a text tower equivalent in size to ViT-H-14 models. At 256x256 resolution, it operates with 222 GMAC and 146 MActs, positioning it between ViT-g-14 and ViT-G-14 in terms of capabilities while being more efficient in resource usage.

  • Training utilized both float16 and bfloat16 precision
  • Implements advanced augmentation techniques including Random Resize Crop and Random Erasing
  • Trained across multiple high-performance computing clusters
  • Uses a global batch size of 81920

Core Capabilities

  • Zero-shot image classification with 79.1% accuracy on ImageNet
  • Image and text retrieval tasks
  • Suitable for downstream fine-tuning
  • Efficient scaling for larger image sizes

Frequently Asked Questions

Q: What makes this model unique?

It's the first non-ViT image tower CLIP model to exceed 79% ImageNet top-1 zero-shot accuracy, and represents the largest released ConvNeXt model pretrained to date.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for downstream task fine-tuning. However, it's not recommended for deployed commercial applications without thorough testing.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.