XLM-Roberta-Large-Vit-B-32

XLM-Roberta-Large-Vit-B-32

M-CLIP

Multilingual CLIP model extending OpenAI's vision-language capabilities to 48 languages, using XLM-RoBERTa architecture with ViT-B/32 visual backbone. Popular with 12M+ downloads.

PropertyValue
AuthorM-CLIP
Downloads12.3M+
Languages Supported48
FrameworkPyTorch, TensorFlow

What is XLM-Roberta-Large-Vit-B-32?

XLM-Roberta-Large-Vit-B-32 is a multilingual extension of OpenAI's CLIP model, designed to bridge the gap between vision and language across 48 different languages. It combines a powerful XLM-RoBERTa text encoder with a ViT-B/32 vision transformer architecture, enabling cross-lingual vision-language understanding.

Implementation Details

The model architecture consists of two main components: a multilingual text encoder based on XLM-RoBERTa-Large and a vision encoder using ViT-B/32. The model achieves impressive performance across multiple languages, with R@10 scores of 91.8% for English and maintaining strong performance (80-90%) across other languages like German, Spanish, French, and Chinese.

  • Multilingual text encoding supporting 48 languages including English, German, Chinese, Russian, and more
  • Compatible with both PyTorch and TensorFlow frameworks
  • Demonstrated strong cross-lingual retrieval capabilities
  • Easy integration with the multilingual-clip package

Core Capabilities

  • Cross-lingual image-text retrieval
  • Multilingual zero-shot classification
  • Text-to-image search across 48 languages
  • Competitive performance with English-only CLIP models

Frequently Asked Questions

Q: What makes this model unique?

This model extends CLIP's capabilities to 48 languages while maintaining performance comparable to the original English model. It's particularly notable for achieving over 88% R@10 scores across most supported languages.

Q: What are the recommended use cases?

The model is ideal for multilingual image-text retrieval tasks, cross-lingual visual search systems, and zero-shot classification applications where content needs to be processed in multiple languages.

Socials
Integrations
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026