CLIP-GmP-ViT-L-14
Property | Value |
---|---|
Parameter Count | 428M |
License | MIT |
Base Model | openai/clip-vit-large-patch14 |
Tensor Type | F32 |
What is CLIP-GmP-ViT-L-14?
CLIP-GmP-ViT-L-14 is an advanced fine-tuned version of OpenAI's CLIP ViT-L/14 model that implements Geometric Parametrization (GmP) to achieve superior performance in image classification tasks. The model notably achieves ~0.91 accuracy on ImageNet/ObjectNet compared to the original model's ~0.84.
Implementation Details
The model employs a unique Geometric Parametrization approach that decomposes weights into radial and angular components, preserving weight vectors' directionality and magnitude. It offers multiple versions including text encoder-only safetensors and full model implementations.
- Implements Geometric Parametrization for improved performance
- Features custom loss function with label smoothing
- Maintains a modality gap of 0.80 (compared to OpenAI pre-trained 0.82)
- Available in multiple formats including text encoder-only and full model versions
Core Capabilities
- Superior text prompt following and detail generation
- Enhanced image classification accuracy
- Seamless integration with Hugging Face Transformers/Diffusers pipeline
- Compatible with various text-to-image models including Flux.1, SD3, SDXL
Frequently Asked Questions
Q: What makes this model unique?
The model's unique Geometric Parametrization approach and custom loss function with label smoothing enable significantly improved accuracy in image classification tasks while maintaining strong text-following capabilities.
Q: What are the recommended use cases?
The model is particularly well-suited for text-to-image generation tasks, zero-shot image classification, and as a text encoder replacement in various stable diffusion models. Different versions are optimized for specific use cases, with the "TEXT" model excelling in text-heavy scenarios and the "SMOOTH" model potentially better for text-free applications.