snowflake-arctic-embed-l-v2.0-ko
Property | Value |
---|---|
Model Type | Sentence Transformer |
Output Dimensions | 1024 |
Max Sequence Length | 8192 tokens |
License | Apache-2.0 |
Paper | Arctic-Embed 2.0: Multilingual Retrieval Without Compromise |
What is snowflake-arctic-embed-l-v2.0-ko?
This is a specialized Korean-optimized sentence transformer model that builds upon Snowflake's arctic-embed architecture. It's designed to create high-quality 1024-dimensional embeddings for Korean text, achieving state-of-the-art performance across multiple Korean retrieval benchmarks. The model has been specifically enhanced with Korean training data while maintaining strong multilingual capabilities.
Implementation Details
The model utilizes an advanced architecture combining XLMRoberta with specialized pooling and normalization layers. It supports a maximum sequence length of 8192 tokens and implements efficient clustering techniques for improved embedding quality.
- Implements CLS token pooling with normalized outputs
- Uses efficient batch processing with no duplicates
- Trained with BF16 precision and warmup_stable_decay learning rate schedule
- Optimized for both phrase-based and full-sentence queries
Core Capabilities
- Achieves SOTA performance across 7 major Korean retrieval benchmarks
- Handles diverse query formats and phrasing variations
- Optimized for Markdown table search and structured content
- Efficient clustering without requiring hard negatives
- Strong cross-domain performance beyond Wikipedia-based tasks
Frequently Asked Questions
Q: What makes this model unique?
The model combines the powerful arctic-embed architecture with specialized Korean language optimization, achieving superior performance across diverse retrieval tasks while maintaining strong multilingual capabilities. Its efficient clustering approach and ability to handle various query formats make it particularly versatile.
Q: What are the recommended use cases?
The model excels in semantic search, document retrieval, and similarity matching tasks for Korean content. It's particularly effective for applications requiring precise semantic understanding of both short queries and longer documents, though it's optimized for documents under 1300 tokens in length.