CLIP-convnext_base_w-laion_aesthetic-s13B-b82K
Property | Value |
---|---|
Training Data | LAION-Aesthetic (900M samples) |
Architecture | ConvNeXt-Base with wide embed dim |
Resolution | 256x256 |
ImageNet Zero-Shot Accuracy | 71.0% |
Model Source | Hugging Face |
What is CLIP-convnext_base_w-laion_aesthetic-s13B-b82K?
This model represents a significant advancement in CLIP architecture, utilizing ConvNeXt-Base as its image tower instead of the traditional ViT or ResNet approaches. Trained on a carefully curated subset of LAION-2B with aesthetic filtering, it demonstrates impressive performance in zero-shot image classification tasks.
Implementation Details
The model employs a ConvNeXt-Base architecture with wide embedding dimensions, trained for 13B samples with a batch size of 81920. It uses Random Resize Crop (0.9, 1.0) for augmentation and achieves state-of-the-art performance for its class.
- Trained on LAION-Aesthetic dataset (900M samples)
- Uses advanced augmentation techniques
- Optimized for 256x256 resolution
- Implements gradient checkpointing for efficient training
Core Capabilities
- Zero-shot image classification with 71.0% accuracy on ImageNet
- Image and text retrieval tasks
- Support for cross-modal understanding
- Efficient scaling with model size and resolution
Frequently Asked Questions
Q: What makes this model unique?
This is one of the first ConvNeXt CLIP models trained at scale, offering an alternative to ViT and ResNet architectures while achieving superior sample efficiency compared to ViT-B/16.
Q: What are the recommended use cases?
The model is best suited for research purposes, including zero-shot classification, image-text retrieval, and as a foundation for fine-tuning specific tasks. However, it's not recommended for deployment in production environments without thorough testing.