Vietnamese_Embedding

Maintained By
AITeamVN

Vietnamese_Embedding

PropertyValue
Model TypeSentence Transformer
Base ModelBAAI/bge-m3
Output Dimensions1024
Max Sequence Length2048 tokens
AuthorsNguyễn Nho Trung, Nguyễn Nhật Quang
Model URLhuggingface.co/AITeamVN/Vietnamese_Embedding

What is Vietnamese_Embedding?

Vietnamese_Embedding is a specialized embedding model designed specifically for Vietnamese language processing. Fine-tuned from the BGE-M3 model, it has been optimized using approximately 300,000 triplets of queries, positive documents, and negative documents in Vietnamese. The model demonstrates superior performance in Vietnamese text retrieval tasks compared to existing alternatives.

Implementation Details

The model utilizes a sentence transformer architecture with dot product similarity scoring. It processes input text with a maximum sequence length of 2048 tokens and generates 1024-dimensional embeddings. The implementation shows significant improvements over the base BGE-M3 model, particularly in Vietnamese-specific tasks.

  • Trained on 300,000 Vietnamese text triplets
  • Supports long sequences up to 2048 tokens
  • Generates high-quality 1024-dimensional embeddings
  • Uses efficient dot product similarity scoring

Core Capabilities

  • Achieves 72.74% Accuracy@1 on Legal Zalo 2021 dataset
  • Outperforms Vietnamese-bi-encoder (BKAI) and BGE-M3 in benchmarks
  • Excellent for Vietnamese text similarity and retrieval tasks
  • Robust performance across various accuracy metrics (MRR@10: 0.8181)

Frequently Asked Questions

Q: What makes this model unique?

The model's unique strength lies in its specialized optimization for Vietnamese language processing, achieving state-of-the-art performance on Vietnamese text retrieval tasks while maintaining efficient computation through dot product similarity.

Q: What are the recommended use cases?

The model is ideal for Vietnamese language applications including semantic search, document similarity analysis, and information retrieval tasks. It's particularly effective for legal document processing, as demonstrated by its performance on the Legal Zalo 2021 dataset.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.