Jina Embeddings V2 Base (English)
Property | Value |
---|---|
Parameter Count | 137M |
Max Sequence Length | 8192 tokens |
License | Apache 2.0 |
Paper | Technical Report |
Architecture | BERT with ALiBi positioning |
What is jina-embeddings-v2-base-en?
Jina Embeddings V2 Base is a state-of-the-art English embedding model that leverages advanced ALiBi positioning technology to handle extraordinarily long sequences up to 8192 tokens. Trained on over 400 million carefully curated sentence pairs, it represents a perfect balance between computational efficiency and performance.
Implementation Details
The model utilizes a modified BERT architecture with symmetric bidirectional ALiBi positioning, initially trained on the C4 dataset and further refined on a massive collection of high-quality sentence pairs. While trained at 512 sequence length, it successfully extrapolates to 8K tokens thanks to its innovative architecture.
- 137M parameters optimized for production deployment
- Supports extreme sequence lengths up to 8192 tokens
- Trained on 400M+ carefully selected sentence pairs
- Implements symmetric bidirectional ALiBi positioning
Core Capabilities
- Long document retrieval and processing
- Semantic textual similarity analysis
- Advanced text reranking
- RAG (Retrieval-Augmented Generation) applications
- LLM-based generative search
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to handle 8192-token sequences while maintaining high performance sets it apart, making it particularly suitable for long-document processing and RAG applications. According to LlamaIndex, it achieves top-tier performance when combined with rerankers.
Q: What are the recommended use cases?
The model excels in document retrieval, semantic search, and RAG applications. It's particularly effective for scenarios requiring long text understanding and comparison, such as academic paper analysis, legal document processing, or technical documentation search.