jina-clip-v2

Maintained By
jinaai

Jina CLIP v2

PropertyValue
Parameter Count865M
LicenseCC BY-NC 4.0
PaperarXiv:2405.20204
Languages Supported94 languages
Tensor TypeFP16

What is jina-clip-v2?

Jina CLIP v2 is a state-of-the-art multilingual multimodal embedding model designed for text and image processing. Built upon its predecessor and incorporating advanced features from jina-embeddings-v3, this model represents a significant leap forward in multimodal AI capabilities. With 865M parameters, it combines two powerful encoders: a Jina-XLM-RoBERTa text encoder (561M parameters) and an EVA02-L14 vision encoder (304M parameters).

Implementation Details

The model architecture leverages cutting-edge technologies including FlashAttention2 for text processing and xFormers for vision processing. It supports high-resolution image inputs up to 512x512 pixels and can process text sequences up to 8,192 tokens.

  • Improved performance with 3% gain over v1 in text-image and text-text retrieval
  • Support for 89 languages with enhanced multilingual-image retrieval capabilities
  • Matryoshka representation allowing dimension reduction from 1024 to 64
  • Advanced attention mechanisms with FlashAttention2 and xFormers integration

Core Capabilities

  • Cross-modal search and retrieval between text and images
  • Multilingual text understanding and processing
  • High-resolution image processing (512x512)
  • Flexible embedding dimensions through matryoshka representation
  • Efficient inference with bfloat16 precision support

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its combination of multilingual capabilities, high-resolution image processing, and flexible embedding dimensions. The matryoshka representation feature allows users to optimize between performance and computational efficiency by adjusting embedding dimensions.

Q: What are the recommended use cases?

The model excels in multilingual image search, cross-modal retrieval systems, content recommendation, and general-purpose multimodal applications. It's particularly suitable for applications requiring language-agnostic understanding of text and images.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.