XLM-Roberta-Large-Vit-B-32

M-CLIP

Multilingual CLIP model extending OpenAI's vision-language capabilities to 48 languages, using XLM-RoBERTa architecture with ViT-B/32 visual backbone. Popular with 12M+ downloads.

Property	Value
Author	M-CLIP
Downloads	12.3M+
Languages Supported	48
Framework	PyTorch, TensorFlow

What is XLM-Roberta-Large-Vit-B-32?

XLM-Roberta-Large-Vit-B-32 is a multilingual extension of OpenAI's CLIP model, designed to bridge the gap between vision and language across 48 different languages. It combines a powerful XLM-RoBERTa text encoder with a ViT-B/32 vision transformer architecture, enabling cross-lingual vision-language understanding.

Implementation Details

The model architecture consists of two main components: a multilingual text encoder based on XLM-RoBERTa-Large and a vision encoder using ViT-B/32. The model achieves impressive performance across multiple languages, with R@10 scores of 91.8% for English and maintaining strong performance (80-90%) across other languages like German, Spanish, French, and Chinese.

Multilingual text encoding supporting 48 languages including English, German, Chinese, Russian, and more
Compatible with both PyTorch and TensorFlow frameworks
Demonstrated strong cross-lingual retrieval capabilities
Easy integration with the multilingual-clip package

Core Capabilities

Cross-lingual image-text retrieval
Multilingual zero-shot classification
Text-to-image search across 48 languages
Competitive performance with English-only CLIP models

Frequently Asked Questions

Q: What makes this model unique?

This model extends CLIP's capabilities to 48 languages while maintaining performance comparable to the original English model. It's particularly notable for achieving over 88% R@10 scores across most supported languages.

Q: What are the recommended use cases?

The model is ideal for multilingual image-text retrieval tasks, cross-lingual visual search systems, and zero-shot classification applications where content needs to be processed in multiple languages.