vietnamese-embedding

Maintained By
dangvantuan

vietnamese-embedding

PropertyValue
Parameter Count135M
LicenseApache 2.0
Primary PaperSimCSE Paper
Tensor TypeF32

What is vietnamese-embedding?

vietnamese-embedding is a state-of-the-art sentence embedding model specifically designed for the Vietnamese language. Built on PhoBERT's RoBERTa architecture, it generates 768-dimensional vectors that capture the semantic meaning of Vietnamese text. The model has achieved impressive results, outperforming other Vietnamese embedding models with 84.87% accuracy on the STSB benchmark.

Implementation Details

The model implements a sophisticated four-stage training process, utilizing SimCSE approach with supervised contrastive learning. It employs a Transformer architecture with mean pooling and has been fine-tuned on multiple Vietnamese datasets including ViNLI-SimCSE-supervised and XNLI-vn.

  • Pre-trained base: PhoBERT (RoBERTa architecture)
  • Embedding dimension: 768
  • Maximum sequence length: 512
  • Training methodology: Multi-stage fine-tuning with triplet loss

Core Capabilities

  • Semantic sentence similarity computation
  • High-quality Vietnamese text embeddings
  • Supports various NLP tasks including clustering and semantic search
  • Demonstrated superior performance across multiple STS benchmarks

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its four-stage training process specifically optimized for Vietnamese language understanding, resulting in state-of-the-art performance across multiple semantic textual similarity benchmarks.

Q: What are the recommended use cases?

The model is ideal for semantic search, text clustering, sentence similarity comparison, and other NLP tasks requiring deep semantic understanding of Vietnamese text.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.