DEk21_hcmute_embedding
Property | Value |
---|---|
Model Type | Sentence Transformer |
Language | Vietnamese |
Dimensions | 768 |
Max Sequence Length | 512 tokens |
License | Apache-2.0 |
What is DEk21_hcmute_embedding?
DEk21_hcmute_embedding is a specialized Vietnamese text embedding model designed specifically for legal domain applications. Developed by huyydangg, this model stands out for its efficient implementation of Matryoshka loss, enabling dimensional flexibility while maintaining performance. The model was trained on an extensive dataset of 100,000 legal questions and their contextual pairs, making it particularly effective for legal information retrieval tasks.
Implementation Details
The model is built on the Sentence Transformer architecture, utilizing a RoBERTa-based backbone with mean pooling. It produces 768-dimensional embeddings that can be dynamically truncated to smaller dimensions (512, 256, 128, or 64) with minimal performance degradation, thanks to its Matryoshka training approach.
- Architecture: RoBERTa-based Sentence Transformer
- Pooling Strategy: Mean tokens pooling
- Similarity Metric: Cosine Similarity
- Training Innovation: Matryoshka loss for dimensional flexibility
Core Capabilities
- Superior performance in Vietnamese legal document retrieval (0.8112 MAP score at full dimensionality)
- Efficient dimensional reduction capabilities (maintains 0.7718 MAP score even at 64 dimensions)
- Optimized for production deployment with flexible embedding sizes
- Specialized in legal question-answering and document matching
Frequently Asked Questions
Q: What makes this model unique?
The model's implementation of Matryoshka loss allows for dynamic dimension reduction while maintaining high performance, making it particularly suitable for production environments where computational efficiency is crucial. Additionally, its specialized training on legal content makes it uniquely suited for legal domain applications.
Q: What are the recommended use cases?
The model excels in legal document retrieval, question-answering systems, and RAG applications focused on Vietnamese legal content. It's particularly effective for building search systems, document matching, and semantic similarity tasks in the legal domain.