fashion-clip

Maintained By
patrickjohncyh

Fashion-CLIP

PropertyValue
Parameter Count151M
LicenseMIT
PaperScientific Reports
ArchitectureViT-B/32 + Transformer

What is fashion-clip?

Fashion-CLIP is a specialized adaptation of the CLIP architecture, fine-tuned specifically for fashion-related tasks. Built upon the LAION CLIP checkpoint, it's trained on 800K fashion products from the Farfetch dataset to understand and represent fashion concepts in both visual and textual forms.

Implementation Details

The model employs a ViT-B/32 Transformer for image encoding and a masked self-attention Transformer for text encoding. It's trained using contrastive learning to maximize the similarity between matched image-text pairs from fashion products.

  • Utilizes white-background product images and detailed text descriptions
  • Trained on concatenated product highlights and descriptions
  • Achieves superior performance compared to base CLIP models on fashion tasks

Core Capabilities

  • Zero-shot fashion product classification
  • Cross-modal fashion concept understanding
  • Product attribute detection and matching
  • Improved performance on fashion-specific benchmarks (FMNIST: 0.83, KAGL: 0.73, DEEP: 0.62)

Frequently Asked Questions

Q: What makes this model unique?

Fashion-CLIP 2.0 demonstrates significant improvements over both the original CLIP and previous Fashion-CLIP versions, particularly in fashion-specific zero-shot transfer tasks. It leverages the LAION checkpoint's broader training data while maintaining specialized fashion understanding.

Q: What are the recommended use cases?

The model is ideal for e-commerce applications, product categorization, fashion recommendation systems, and zero-shot classification of fashion items. It performs best with standard product images on white backgrounds and detailed textual descriptions.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.