vietnamese-document-embedding

Maintained By
dangvantuan

vietnamese-document-embedding

PropertyValue
Authordangvantuan
Context Length8096 tokens
Model Basegte-multilingual
Hugging FaceModel Repository

What is vietnamese-document-embedding?

vietnamese-document-embedding is a specialized document embedding model designed specifically for the Vietnamese language. Built upon the gte-multilingual architecture, this model excels at generating high-quality embeddings for long Vietnamese texts with context lengths up to 8096 tokens. The model implements advanced training techniques including Multi-Negative Ranking Loss, Matryoshka2dLoss, and SimilarityLoss to achieve state-of-the-art performance.

Implementation Details

The model utilizes a sophisticated architecture combining a Transformer with custom pooling and normalization layers. Training occurred in multiple stages, including NLI training on XNLI-vn and fine-tuning on the STSB-vn benchmark. The model achieved impressive results across various STS benchmarks, with a mean Spearman score of 82.45%.

  • Custom pooling with CLS token focus
  • Trained using multiple loss functions for optimal performance
  • Comprehensive evaluation across multiple STS benchmarks
  • Supports both semantic similarity and document embedding tasks

Core Capabilities

  • Long document embedding up to 8096 tokens
  • Specialized Vietnamese language understanding
  • High performance on semantic textual similarity tasks
  • Easy integration with sentence-transformers library

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized focus on Vietnamese language processing and exceptional context length of 8096 tokens, making it ideal for long document processing. Its multi-stage training process and implementation of advanced loss functions result in state-of-the-art performance on Vietnamese language tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for document similarity comparison, semantic search, clustering of Vietnamese documents, and any NLP task requiring high-quality Vietnamese text embeddings. It's especially valuable for applications dealing with longer texts due to its extended context length.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.