jina-embeddings-v2-base-de
Property | Value |
---|---|
Parameter Count | 161M |
License | Apache 2.0 |
Paper | Multi-Task Contrastive Learning Paper |
Max Sequence Length | 8192 tokens |
What is jina-embeddings-v2-base-de?
jina-embeddings-v2-base-de is a powerful bilingual text embedding model designed specifically for German and English language processing. Built on a modified BERT architecture (JinaBERT), it leverages symmetric bidirectional ALiBi to handle exceptionally long sequences up to 8192 tokens. The model excels in both monolingual and cross-lingual applications, particularly in scenarios involving mixed German-English content.
Implementation Details
The model utilizes mean pooling for optimal embedding generation and can be easily integrated using popular frameworks like transformers or sentence-transformers. It's been extensively evaluated on the MTEB benchmark, showing strong performance across various German and English tasks.
- Architecture: Modified BERT with symmetric ALiBi
- Parameter Count: 161 million
- Maximum Sequence Length: 8192 tokens
- Supported Languages: German and English
Core Capabilities
- Bilingual text embedding generation
- Long sequence processing (up to 8192 tokens)
- High performance in cross-lingual applications
- Efficient mean pooling implementation
- State-of-the-art performance on MTEB benchmark
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to handle extremely long sequences (8192 tokens) and its specialized optimization for German-English bilingual content sets it apart. It uses a symmetric bidirectional ALiBi architecture, making it particularly effective for cross-lingual applications.
Q: What are the recommended use cases?
The model excels in various applications including cross-lingual information retrieval, semantic search, document similarity analysis, and RAG (Retrieval-Augmented Generation) systems. It's particularly effective when working with mixed German-English content or when requiring long sequence processing.