CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k

Maintained By
laion

CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k

PropertyValue
Model TypeCLIP Vision-Language Model
ArchitectureViT-B/32 + XLM-RoBERTa Base
Training DataLAION-5B
Batch Size90,000
Training Samples13B
AuthorLAION/Romain Beaumont

What is CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k?

This is a multilingual CLIP model that combines a ViT-B/32 vision transformer with an XLM-RoBERTa base text encoder, trained on the massive LAION-5B dataset. The model was developed by LAION and trained on stability.ai's computing infrastructure. It represents a significant advancement in multilingual vision-language modeling, achieving impressive results across multiple languages.

Implementation Details

The model employs OpenCLIP's implementation framework and was trained with a substantial batch size of 90,000 over 13B samples from LAION-5B. The vision encoder uses a ViT-B/32 architecture, while the text encoder leverages pretrained XLM-RoBERTa base weights for enhanced multilingual capabilities.

  • Achieves 62.33% accuracy on ImageNet-1K
  • Demonstrates 63.4% performance on MSCOCO
  • Shows 86.2% accuracy on Flickr30k
  • Exhibits strong multilingual capabilities with 43% on Italian and 37% on Japanese ImageNet

Core Capabilities

  • Zero-shot image classification across multiple languages
  • Image and text retrieval tasks
  • Cross-lingual image understanding
  • Fine-tuning for downstream vision tasks
  • Image generation guidance and conditioning

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its multilingual capabilities, combining the power of ViT-B/32 vision processing with XLM-RoBERTa's multilingual text understanding. It significantly outperforms monolingual models in non-English tasks while maintaining competitive performance on English benchmarks.

Q: What are the recommended use cases?

The model excels in zero-shot image classification, cross-lingual image-text retrieval, and can be fine-tuned for various downstream tasks. It's particularly valuable for applications requiring multilingual image understanding or cross-lingual visual search capabilities.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.