CLIP-convnext_base_w-laion2B-s13B-b82K
Property | Value |
---|---|
License | MIT |
Training Dataset | LAION-2B |
Resolution | 256x256 |
Zero-Shot ImageNet Accuracy | 70.8% |
Paper | ConvNeXt Paper |
What is CLIP-convnext_base_w-laion2B-s13B-b82K?
This model represents a significant advancement in CLIP architecture, utilizing ConvNeXt-Base as its backbone. It's trained on LAION-2B dataset with 13B samples, offering an efficient alternative to traditional ViT and ResNet implementations. The model achieves impressive 70.8% zero-shot accuracy on ImageNet, demonstrating superior sample efficiency compared to ViT-B/16.
Implementation Details
The model combines a ConvNeXt-Base image tower with a text tower equivalent to RN50x4 (depth 12, embed dim 640) from OpenAI CLIP. Training was conducted with a global batch size of 81920 and utilized advanced techniques like Random Resize Crop and precision amp_bfloat16.
- Training Resolution: 256x256
- Batch Size: 81920
- Learning Rate: 1e-3
- Training Samples: ~13B
- Architecture: ConvNeXt-Base with wide embed dim
Core Capabilities
- Zero-shot image classification
- Image and text retrieval
- Efficient scaling with model size and image resolution
- Robust performance across varying resolutions
- Compatible with downstream task fine-tuning
Frequently Asked Questions
Q: What makes this model unique?
This is the first known ConvNeXt CLIP model trained at scale, achieving comparable performance to CLIP ViT-B/16 and RN50x4 models while potentially being more sample efficient.
Q: What are the recommended use cases?
The model is primarily intended for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for downstream task fine-tuning. However, it's not recommended for deployed commercial applications without thorough testing.