CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k

Maintained By
laion

CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k

PropertyValue
LicenseMIT
Training DatasetLAION-5B
ArchitectureCLIP ViT-H/14 + XLM-RoBERTa Large
PaperVTAB Paper

What is CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k?

This model represents a significant advancement in multilingual vision-language modeling, combining a frozen CLIP ViT-H/14 vision encoder with an XLM-RoBERTa large language model. Trained on the extensive LAION-5B dataset with a batch size of 90,000 over 13B samples, it achieves impressive cross-lingual performance in vision-language tasks.

Implementation Details

The model utilizes a unique architecture where the vision transformer (ViT-H/14) component remains frozen during training, initialized from LAION's previous work. The text encoder, based on XLM-RoBERTa large, was initialized with pretrained weights and fine-tuned during training. This approach yields remarkable results, achieving 77.0% accuracy on ImageNet-1K.

  • Trained on LAION-5B dataset
  • Frozen ViT-H/14 vision encoder
  • XLM-RoBERTa large text encoder
  • 90k batch size training

Core Capabilities

  • Zero-shot image classification in multiple languages
  • Image and text retrieval
  • Multilingual performance (Italian: 56%, Japanese: 53%, Chinese: 55.7%)
  • Fine-tuning for downstream tasks

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its exceptional multilingual capabilities while maintaining strong English performance. It significantly outperforms language-specific CLIP models in several languages while matching the performance of English-only models.

Q: What are the recommended use cases?

The model excels in zero-shot image classification, cross-lingual image-text retrieval, and can be fine-tuned for various downstream tasks. It's particularly valuable for applications requiring multilingual vision-language understanding.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.