CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k

Property	Value
License	MIT
Training Dataset	LAION-5B
Architecture	CLIP ViT-H/14 + XLM-RoBERTa Large
Paper	VTAB Paper

What is CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k?

This model represents a significant advancement in multilingual vision-language modeling, combining a frozen CLIP ViT-H/14 vision encoder with an XLM-RoBERTa large language model. Trained on the extensive LAION-5B dataset with a batch size of 90,000 over 13B samples, it achieves impressive cross-lingual performance in vision-language tasks.

Implementation Details

The model utilizes a unique architecture where the vision transformer (ViT-H/14) component remains frozen during training, initialized from LAION's previous work. The text encoder, based on XLM-RoBERTa large, was initialized with pretrained weights and fine-tuned during training. This approach yields remarkable results, achieving 77.0% accuracy on ImageNet-1K.

Trained on LAION-5B dataset
Frozen ViT-H/14 vision encoder
XLM-RoBERTa large text encoder
90k batch size training

Core Capabilities

Zero-shot image classification in multiple languages
Image and text retrieval
Multilingual performance (Italian: 56%, Japanese: 53%, Chinese: 55.7%)
Fine-tuning for downstream tasks

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its exceptional multilingual capabilities while maintaining strong English performance. It significantly outperforms language-specific CLIP models in several languages while matching the performance of English-only models.

Q: What are the recommended use cases?

The model excels in zero-shot image classification, cross-lingual image-text retrieval, and can be fine-tuned for various downstream tasks. It's particularly valuable for applications requiring multilingual vision-language understanding.