GLuCoSE-base-ja-v2
Property | Value |
---|---|
Parameter Count | 133M |
License | Apache-2.0 |
Maximum Sequence Length | 512 tokens |
Output Dimensions | 768 |
Language | Japanese |
What is GLuCoSE-base-ja-v2?
GLuCoSE-base-ja-v2 is a specialized Japanese text embedding model designed for high-performance retrieval tasks. Built upon the original GLuCoSE architecture, this model has been fine-tuned through an innovative multi-stage process involving distillation from larger models and contrastive learning. It achieves state-of-the-art performance among similar-sized models in various Japanese language tasks while maintaining efficiency.
Implementation Details
The model employs a sophisticated three-step training approach: ensemble distillation using teacher models like E5-mistral and gte-Qwen2, contrastive learning with multiple datasets, and search-specific optimization. It operates using cosine similarity for comparing embeddings and requires specific prefixes ("query:" or "passage:") for input processing.
- Optimized for CPU inference with efficient processing
- Achieves 85.5% Recall@5 on MIRACL benchmark
- Supports both SentenceTransformers and Transformers implementations
- Features 768-dimensional output embeddings
Core Capabilities
- High-performance text retrieval and semantic search
- Sentence similarity computation
- Document embedding and comparison
- Cross-lingual capability with Japanese focus
Frequently Asked Questions
Q: What makes this model unique?
GLuCoSE-base-ja-v2 stands out for its exceptional performance in Japanese language tasks while maintaining a relatively small parameter count (133M). It achieves competitive results against larger models like multilingual-e5-large (600M parameters) while being more efficient to deploy.
Q: What are the recommended use cases?
The model excels in Japanese text retrieval tasks, semantic search applications, and sentence similarity measurements. It's particularly well-suited for production environments where CPU inference is required, making it ideal for applications in search engines, recommendation systems, and document comparison tools.