CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k
Property | Value |
---|---|
License | MIT |
Training Dataset | LAION-5B |
Architecture | CLIP ViT-H/14 + XLM-RoBERTa Large |
Paper | VTAB Paper |
What is CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k?
This model represents a significant advancement in multilingual vision-language modeling, combining a frozen CLIP ViT-H/14 vision encoder with an XLM-RoBERTa large language model. Trained on the extensive LAION-5B dataset with a batch size of 90,000 over 13B samples, it achieves impressive cross-lingual performance in vision-language tasks.
Implementation Details
The model utilizes a unique architecture where the vision transformer (ViT-H/14) component remains frozen during training, initialized from LAION's previous work. The text encoder, based on XLM-RoBERTa large, was initialized with pretrained weights and fine-tuned during training. This approach yields remarkable results, achieving 77.0% accuracy on ImageNet-1K.
- Trained on LAION-5B dataset
- Frozen ViT-H/14 vision encoder
- XLM-RoBERTa large text encoder
- 90k batch size training
Core Capabilities
- Zero-shot image classification in multiple languages
- Image and text retrieval
- Multilingual performance (Italian: 56%, Japanese: 53%, Chinese: 55.7%)
- Fine-tuning for downstream tasks
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its exceptional multilingual capabilities while maintaining strong English performance. It significantly outperforms language-specific CLIP models in several languages while matching the performance of English-only models.
Q: What are the recommended use cases?
The model excels in zero-shot image classification, cross-lingual image-text retrieval, and can be fine-tuned for various downstream tasks. It's particularly valuable for applications requiring multilingual vision-language understanding.