Jina Embeddings V2 Base (English)

Property	Value
Parameter Count	137M
Max Sequence Length	8192 tokens
License	Apache 2.0
Paper	Technical Report
Architecture	BERT with ALiBi positioning

What is jina-embeddings-v2-base-en?

Jina Embeddings V2 Base is a state-of-the-art English embedding model that leverages advanced ALiBi positioning technology to handle extraordinarily long sequences up to 8192 tokens. Trained on over 400 million carefully curated sentence pairs, it represents a perfect balance between computational efficiency and performance.

Implementation Details

The model utilizes a modified BERT architecture with symmetric bidirectional ALiBi positioning, initially trained on the C4 dataset and further refined on a massive collection of high-quality sentence pairs. While trained at 512 sequence length, it successfully extrapolates to 8K tokens thanks to its innovative architecture.

137M parameters optimized for production deployment
Supports extreme sequence lengths up to 8192 tokens
Trained on 400M+ carefully selected sentence pairs
Implements symmetric bidirectional ALiBi positioning

Core Capabilities

Long document retrieval and processing
Semantic textual similarity analysis
Advanced text reranking
RAG (Retrieval-Augmented Generation) applications
LLM-based generative search

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to handle 8192-token sequences while maintaining high performance sets it apart, making it particularly suitable for long-document processing and RAG applications. According to LlamaIndex, it achieves top-tier performance when combined with rerankers.

Q: What are the recommended use cases?

The model excels in document retrieval, semantic search, and RAG applications. It's particularly effective for scenarios requiring long text understanding and comparison, such as academic paper analysis, legal document processing, or technical documentation search.