GLuCoSE-base-ja-v2

Property	Value
Parameter Count	133M
License	Apache-2.0
Maximum Sequence Length	512 tokens
Output Dimensions	768
Language	Japanese

What is GLuCoSE-base-ja-v2?

GLuCoSE-base-ja-v2 is a specialized Japanese text embedding model designed for high-performance retrieval tasks. Built upon the original GLuCoSE architecture, this model has been fine-tuned through an innovative multi-stage process involving distillation from larger models and contrastive learning. It achieves state-of-the-art performance among similar-sized models in various Japanese language tasks while maintaining efficiency.

Implementation Details

The model employs a sophisticated three-step training approach: ensemble distillation using teacher models like E5-mistral and gte-Qwen2, contrastive learning with multiple datasets, and search-specific optimization. It operates using cosine similarity for comparing embeddings and requires specific prefixes ("query:" or "passage:") for input processing.

Optimized for CPU inference with efficient processing
Achieves 85.5% Recall@5 on MIRACL benchmark
Supports both SentenceTransformers and Transformers implementations
Features 768-dimensional output embeddings

Core Capabilities

High-performance text retrieval and semantic search
Sentence similarity computation
Document embedding and comparison
Cross-lingual capability with Japanese focus

Frequently Asked Questions

Q: What makes this model unique?

GLuCoSE-base-ja-v2 stands out for its exceptional performance in Japanese language tasks while maintaining a relatively small parameter count (133M). It achieves competitive results against larger models like multilingual-e5-large (600M parameters) while being more efficient to deploy.

Q: What are the recommended use cases?

The model excels in Japanese text retrieval tasks, semantic search applications, and sentence similarity measurements. It's particularly well-suited for production environments where CPU inference is required, making it ideal for applications in search engines, recommendation systems, and document comparison tools.