CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup

CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup

laion

Large-scale CLIP model using ConvNeXt-XXLarge architecture, trained on LAION-2B dataset, achieving 79.4% ImageNet zero-shot accuracy with state-of-the-art performance

PropertyValue
Model Size1.2B parameters
LicenseMIT
Training DataLAION-2B
Zero-shot ImageNet Accuracy79.4%

What is CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup?

This is a groundbreaking CLIP model that utilizes the ConvNeXt-XXLarge architecture as its image tower, representing the largest released ConvNeXt model ever pretrained. It achieves remarkable zero-shot classification performance without requiring previous image tower pretraining, surpassing 79% accuracy on ImageNet.

Implementation Details

The model combines a ConvNeXt-XXLarge image tower (847M parameters) with a text tower equivalent in size to ViT-H-14 models. Training was conducted on the LAION-2B dataset at 256x256 resolution, utilizing a sophisticated training procedure with varying batch sizes and precision formats.

  • Global batch size: 81920-95744
  • Total parameters: 1.2B
  • Training samples: ~34B
  • Resolution: 256x256

Core Capabilities

  • Zero-shot image classification
  • Image and text retrieval
  • Downstream task fine-tuning
  • Image generation guidance

Frequently Asked Questions

Q: What makes this model unique?

It's the first non-ViT image tower CLIP model to exceed 79% ImageNet zero-shot accuracy and represents the largest ConvNeXt model ever released for pretraining tasks.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and fine-tuning for downstream tasks. However, it's not recommended for deployed commercial applications without thorough testing.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026