Fashion-CLIP

Property	Value
Parameter Count	151M
License	MIT
Paper	Scientific Reports
Architecture	ViT-B/32 + Transformer

What is fashion-clip?

Fashion-CLIP is a specialized adaptation of the CLIP architecture, fine-tuned specifically for fashion-related tasks. Built upon the LAION CLIP checkpoint, it's trained on 800K fashion products from the Farfetch dataset to understand and represent fashion concepts in both visual and textual forms.

Implementation Details

The model employs a ViT-B/32 Transformer for image encoding and a masked self-attention Transformer for text encoding. It's trained using contrastive learning to maximize the similarity between matched image-text pairs from fashion products.

Utilizes white-background product images and detailed text descriptions
Trained on concatenated product highlights and descriptions
Achieves superior performance compared to base CLIP models on fashion tasks

Core Capabilities

Zero-shot fashion product classification
Cross-modal fashion concept understanding
Product attribute detection and matching
Improved performance on fashion-specific benchmarks (FMNIST: 0.83, KAGL: 0.73, DEEP: 0.62)

Frequently Asked Questions

Q: What makes this model unique?

Fashion-CLIP 2.0 demonstrates significant improvements over both the original CLIP and previous Fashion-CLIP versions, particularly in fashion-specific zero-shot transfer tasks. It leverages the LAION checkpoint's broader training data while maintaining specialized fashion understanding.

Q: What are the recommended use cases?

The model is ideal for e-commerce applications, product categorization, fashion recommendation systems, and zero-shot classification of fashion items. It performs best with standard product images on white backgrounds and detailed textual descriptions.

fashion-clip