DEk21_hcmute_embedding

Maintained By
huyydangg

DEk21_hcmute_embedding

PropertyValue
Model TypeSentence Transformer
LanguageVietnamese
Dimensions768
Max Sequence Length512 tokens
LicenseApache-2.0

What is DEk21_hcmute_embedding?

DEk21_hcmute_embedding is a specialized Vietnamese text embedding model designed specifically for legal domain applications. Developed by huyydangg, this model stands out for its efficient implementation of Matryoshka loss, enabling dimensional flexibility while maintaining performance. The model was trained on an extensive dataset of 100,000 legal questions and their contextual pairs, making it particularly effective for legal information retrieval tasks.

Implementation Details

The model is built on the Sentence Transformer architecture, utilizing a RoBERTa-based backbone with mean pooling. It produces 768-dimensional embeddings that can be dynamically truncated to smaller dimensions (512, 256, 128, or 64) with minimal performance degradation, thanks to its Matryoshka training approach.

  • Architecture: RoBERTa-based Sentence Transformer
  • Pooling Strategy: Mean tokens pooling
  • Similarity Metric: Cosine Similarity
  • Training Innovation: Matryoshka loss for dimensional flexibility

Core Capabilities

  • Superior performance in Vietnamese legal document retrieval (0.8112 MAP score at full dimensionality)
  • Efficient dimensional reduction capabilities (maintains 0.7718 MAP score even at 64 dimensions)
  • Optimized for production deployment with flexible embedding sizes
  • Specialized in legal question-answering and document matching

Frequently Asked Questions

Q: What makes this model unique?

The model's implementation of Matryoshka loss allows for dynamic dimension reduction while maintaining high performance, making it particularly suitable for production environments where computational efficiency is crucial. Additionally, its specialized training on legal content makes it uniquely suited for legal domain applications.

Q: What are the recommended use cases?

The model excels in legal document retrieval, question-answering systems, and RAG applications focused on Vietnamese legal content. It's particularly effective for building search systems, document matching, and semantic similarity tasks in the legal domain.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.