Jina CLIP v2
Property | Value |
---|---|
Parameter Count | 865M |
License | CC BY-NC 4.0 |
Paper | arXiv:2405.20204 |
Languages Supported | 94 languages |
Tensor Type | FP16 |
What is jina-clip-v2?
Jina CLIP v2 is a state-of-the-art multilingual multimodal embedding model designed for text and image processing. Built upon its predecessor and incorporating advanced features from jina-embeddings-v3, this model represents a significant leap forward in multimodal AI capabilities. With 865M parameters, it combines two powerful encoders: a Jina-XLM-RoBERTa text encoder (561M parameters) and an EVA02-L14 vision encoder (304M parameters).
Implementation Details
The model architecture leverages cutting-edge technologies including FlashAttention2 for text processing and xFormers for vision processing. It supports high-resolution image inputs up to 512x512 pixels and can process text sequences up to 8,192 tokens.
- Improved performance with 3% gain over v1 in text-image and text-text retrieval
- Support for 89 languages with enhanced multilingual-image retrieval capabilities
- Matryoshka representation allowing dimension reduction from 1024 to 64
- Advanced attention mechanisms with FlashAttention2 and xFormers integration
Core Capabilities
- Cross-modal search and retrieval between text and images
- Multilingual text understanding and processing
- High-resolution image processing (512x512)
- Flexible embedding dimensions through matryoshka representation
- Efficient inference with bfloat16 precision support
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its combination of multilingual capabilities, high-resolution image processing, and flexible embedding dimensions. The matryoshka representation feature allows users to optimize between performance and computational efficiency by adjusting embedding dimensions.
Q: What are the recommended use cases?
The model excels in multilingual image search, cross-modal retrieval systems, content recommendation, and general-purpose multimodal applications. It's particularly suitable for applications requiring language-agnostic understanding of text and images.