Vietnamese_Embedding
Property | Value |
---|---|
Model Type | Sentence Transformer |
Base Model | BAAI/bge-m3 |
Output Dimensions | 1024 |
Max Sequence Length | 2048 tokens |
Authors | Nguyễn Nho Trung, Nguyễn Nhật Quang |
Model URL | huggingface.co/AITeamVN/Vietnamese_Embedding |
What is Vietnamese_Embedding?
Vietnamese_Embedding is a specialized embedding model designed specifically for Vietnamese language processing. Fine-tuned from the BGE-M3 model, it has been optimized using approximately 300,000 triplets of queries, positive documents, and negative documents in Vietnamese. The model demonstrates superior performance in Vietnamese text retrieval tasks compared to existing alternatives.
Implementation Details
The model utilizes a sentence transformer architecture with dot product similarity scoring. It processes input text with a maximum sequence length of 2048 tokens and generates 1024-dimensional embeddings. The implementation shows significant improvements over the base BGE-M3 model, particularly in Vietnamese-specific tasks.
- Trained on 300,000 Vietnamese text triplets
- Supports long sequences up to 2048 tokens
- Generates high-quality 1024-dimensional embeddings
- Uses efficient dot product similarity scoring
Core Capabilities
- Achieves 72.74% Accuracy@1 on Legal Zalo 2021 dataset
- Outperforms Vietnamese-bi-encoder (BKAI) and BGE-M3 in benchmarks
- Excellent for Vietnamese text similarity and retrieval tasks
- Robust performance across various accuracy metrics (MRR@10: 0.8181)
Frequently Asked Questions
Q: What makes this model unique?
The model's unique strength lies in its specialized optimization for Vietnamese language processing, achieving state-of-the-art performance on Vietnamese text retrieval tasks while maintaining efficient computation through dot product similarity.
Q: What are the recommended use cases?
The model is ideal for Vietnamese language applications including semantic search, document similarity analysis, and information retrieval tasks. It's particularly effective for legal document processing, as demonstrated by its performance on the Legal Zalo 2021 dataset.