CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup

Maintained By
laion

CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup

PropertyValue
Model Size1.2B parameters
LicenseMIT
Training DataLAION-2B
Zero-shot ImageNet Accuracy79.4%

What is CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup?

This is a groundbreaking CLIP model that utilizes the ConvNeXt-XXLarge architecture as its image tower, representing the largest released ConvNeXt model ever pretrained. It achieves remarkable zero-shot classification performance without requiring previous image tower pretraining, surpassing 79% accuracy on ImageNet.

Implementation Details

The model combines a ConvNeXt-XXLarge image tower (847M parameters) with a text tower equivalent in size to ViT-H-14 models. Training was conducted on the LAION-2B dataset at 256x256 resolution, utilizing a sophisticated training procedure with varying batch sizes and precision formats.

  • Global batch size: 81920-95744
  • Total parameters: 1.2B
  • Training samples: ~34B
  • Resolution: 256x256

Core Capabilities

  • Zero-shot image classification
  • Image and text retrieval
  • Downstream task fine-tuning
  • Image generation guidance

Frequently Asked Questions

Q: What makes this model unique?

It's the first non-ViT image tower CLIP model to exceed 79% ImageNet zero-shot accuracy and represents the largest ConvNeXt model ever released for pretraining tasks.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and fine-tuning for downstream tasks. However, it's not recommended for deployed commercial applications without thorough testing.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.