CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg

Property	Value
Total Parameters	1.2B
License	MIT
Training Data	LAION-2B
Image Resolution	256x256
Zero-shot ImageNet Accuracy	79.1%

What is CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg?

This is a groundbreaking CLIP model that uses the ConvNeXt-XXLarge architecture as its image tower, representing the largest released ConvNeXt model pretrained with 847M parameters. Trained on the LAION-2B dataset, it achieves impressive zero-shot classification capabilities without requiring previous image tower pretraining.

Implementation Details

The model combines a ConvNeXt-XXLarge image tower with a text tower equivalent in size to ViT-H-14 models. At 256x256 resolution, it operates with 222 GMAC and 146 MActs, positioning it between ViT-g-14 and ViT-G-14 in terms of capabilities while being more efficient in resource usage.

Training utilized both float16 and bfloat16 precision
Implements advanced augmentation techniques including Random Resize Crop and Random Erasing
Trained across multiple high-performance computing clusters
Uses a global batch size of 81920

Core Capabilities

Zero-shot image classification with 79.1% accuracy on ImageNet
Image and text retrieval tasks
Suitable for downstream fine-tuning
Efficient scaling for larger image sizes

Frequently Asked Questions

Q: What makes this model unique?

It's the first non-ViT image tower CLIP model to exceed 79% ImageNet top-1 zero-shot accuracy, and represents the largest released ConvNeXt model pretrained to date.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for downstream task fine-tuning. However, it's not recommended for deployed commercial applications without thorough testing.