DEk21_hcmute_embedding

Property	Value
Model Type	Sentence Transformer
Language	Vietnamese
Dimensions	768
Max Sequence Length	512 tokens
License	Apache-2.0

What is DEk21_hcmute_embedding?

DEk21_hcmute_embedding is a specialized Vietnamese text embedding model designed specifically for legal domain applications. Developed by huyydangg, this model stands out for its efficient implementation of Matryoshka loss, enabling dimensional flexibility while maintaining performance. The model was trained on an extensive dataset of 100,000 legal questions and their contextual pairs, making it particularly effective for legal information retrieval tasks.

Implementation Details

The model is built on the Sentence Transformer architecture, utilizing a RoBERTa-based backbone with mean pooling. It produces 768-dimensional embeddings that can be dynamically truncated to smaller dimensions (512, 256, 128, or 64) with minimal performance degradation, thanks to its Matryoshka training approach.

Architecture: RoBERTa-based Sentence Transformer
Pooling Strategy: Mean tokens pooling
Similarity Metric: Cosine Similarity
Training Innovation: Matryoshka loss for dimensional flexibility

Core Capabilities

Superior performance in Vietnamese legal document retrieval (0.8112 MAP score at full dimensionality)
Efficient dimensional reduction capabilities (maintains 0.7718 MAP score even at 64 dimensions)
Optimized for production deployment with flexible embedding sizes
Specialized in legal question-answering and document matching

Frequently Asked Questions

Q: What makes this model unique?

The model's implementation of Matryoshka loss allows for dynamic dimension reduction while maintaining high performance, making it particularly suitable for production environments where computational efficiency is crucial. Additionally, its specialized training on legal content makes it uniquely suited for legal domain applications.

Q: What are the recommended use cases?

The model excels in legal document retrieval, question-answering systems, and RAG applications focused on Vietnamese legal content. It's particularly effective for building search systems, document matching, and semantic similarity tasks in the legal domain.