Jina CLIP v2

Property	Value
Parameter Count	865M
License	CC BY-NC 4.0
Paper	arXiv:2405.20204
Languages Supported	94 languages
Tensor Type	FP16

What is jina-clip-v2?

Jina CLIP v2 is a state-of-the-art multilingual multimodal embedding model designed for text and image processing. Built upon its predecessor and incorporating advanced features from jina-embeddings-v3, this model represents a significant leap forward in multimodal AI capabilities. With 865M parameters, it combines two powerful encoders: a Jina-XLM-RoBERTa text encoder (561M parameters) and an EVA02-L14 vision encoder (304M parameters).

Implementation Details

The model architecture leverages cutting-edge technologies including FlashAttention2 for text processing and xFormers for vision processing. It supports high-resolution image inputs up to 512x512 pixels and can process text sequences up to 8,192 tokens.

Improved performance with 3% gain over v1 in text-image and text-text retrieval
Support for 89 languages with enhanced multilingual-image retrieval capabilities
Matryoshka representation allowing dimension reduction from 1024 to 64
Advanced attention mechanisms with FlashAttention2 and xFormers integration

Core Capabilities

Cross-modal search and retrieval between text and images
Multilingual text understanding and processing
High-resolution image processing (512x512)
Flexible embedding dimensions through matryoshka representation
Efficient inference with bfloat16 precision support

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its combination of multilingual capabilities, high-resolution image processing, and flexible embedding dimensions. The matryoshka representation feature allows users to optimize between performance and computational efficiency by adjusting embedding dimensions.

Q: What are the recommended use cases?

The model excels in multilingual image search, cross-modal retrieval systems, content recommendation, and general-purpose multimodal applications. It's particularly suitable for applications requiring language-agnostic understanding of text and images.

jina-clip-v2