Fashion-CLIP
Property | Value |
---|---|
Parameter Count | 151M |
License | MIT |
Paper | Scientific Reports |
Architecture | ViT-B/32 + Transformer |
What is fashion-clip?
Fashion-CLIP is a specialized adaptation of the CLIP architecture, fine-tuned specifically for fashion-related tasks. Built upon the LAION CLIP checkpoint, it's trained on 800K fashion products from the Farfetch dataset to understand and represent fashion concepts in both visual and textual forms.
Implementation Details
The model employs a ViT-B/32 Transformer for image encoding and a masked self-attention Transformer for text encoding. It's trained using contrastive learning to maximize the similarity between matched image-text pairs from fashion products.
- Utilizes white-background product images and detailed text descriptions
- Trained on concatenated product highlights and descriptions
- Achieves superior performance compared to base CLIP models on fashion tasks
Core Capabilities
- Zero-shot fashion product classification
- Cross-modal fashion concept understanding
- Product attribute detection and matching
- Improved performance on fashion-specific benchmarks (FMNIST: 0.83, KAGL: 0.73, DEEP: 0.62)
Frequently Asked Questions
Q: What makes this model unique?
Fashion-CLIP 2.0 demonstrates significant improvements over both the original CLIP and previous Fashion-CLIP versions, particularly in fashion-specific zero-shot transfer tasks. It leverages the LAION checkpoint's broader training data while maintaining specialized fashion understanding.
Q: What are the recommended use cases?
The model is ideal for e-commerce applications, product categorization, fashion recommendation systems, and zero-shot classification of fashion items. It performs best with standard product images on white backgrounds and detailed textual descriptions.