vietnamese-document-embedding

dangvantuan

Vietnamese document embedding model with 8096 token context, trained on XNLI-vn and STSB-vn datasets, achieving 82.45% mean Spearman score across STS benchmarks.

Property	Value
Author	dangvantuan
Context Length	8096 tokens
Model Base	gte-multilingual
Hugging Face	Model Repository

What is vietnamese-document-embedding?

vietnamese-document-embedding is a specialized document embedding model designed specifically for the Vietnamese language. Built upon the gte-multilingual architecture, this model excels at generating high-quality embeddings for long Vietnamese texts with context lengths up to 8096 tokens. The model implements advanced training techniques including Multi-Negative Ranking Loss, Matryoshka2dLoss, and SimilarityLoss to achieve state-of-the-art performance.

Implementation Details

The model utilizes a sophisticated architecture combining a Transformer with custom pooling and normalization layers. Training occurred in multiple stages, including NLI training on XNLI-vn and fine-tuning on the STSB-vn benchmark. The model achieved impressive results across various STS benchmarks, with a mean Spearman score of 82.45%.

Custom pooling with CLS token focus
Trained using multiple loss functions for optimal performance
Comprehensive evaluation across multiple STS benchmarks
Supports both semantic similarity and document embedding tasks

Core Capabilities

Long document embedding up to 8096 tokens
Specialized Vietnamese language understanding
High performance on semantic textual similarity tasks
Easy integration with sentence-transformers library

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized focus on Vietnamese language processing and exceptional context length of 8096 tokens, making it ideal for long document processing. Its multi-stage training process and implementation of advanced loss functions result in state-of-the-art performance on Vietnamese language tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for document similarity comparison, semantic search, clustering of Vietnamese documents, and any NLP task requiring high-quality Vietnamese text embeddings. It's especially valuable for applications dealing with longer texts due to its extended context length.