Vietnamese_Embedding

Property	Value
Model Type	Sentence Transformer
Base Model	BAAI/bge-m3
Output Dimensions	1024
Max Sequence Length	2048 tokens
Authors	Nguyễn Nho Trung, Nguyễn Nhật Quang
Model URL	huggingface.co/AITeamVN/Vietnamese_Embedding

What is Vietnamese_Embedding?

Vietnamese_Embedding is a specialized embedding model designed specifically for Vietnamese language processing. Fine-tuned from the BGE-M3 model, it has been optimized using approximately 300,000 triplets of queries, positive documents, and negative documents in Vietnamese. The model demonstrates superior performance in Vietnamese text retrieval tasks compared to existing alternatives.

Implementation Details

The model utilizes a sentence transformer architecture with dot product similarity scoring. It processes input text with a maximum sequence length of 2048 tokens and generates 1024-dimensional embeddings. The implementation shows significant improvements over the base BGE-M3 model, particularly in Vietnamese-specific tasks.

Trained on 300,000 Vietnamese text triplets
Supports long sequences up to 2048 tokens
Generates high-quality 1024-dimensional embeddings
Uses efficient dot product similarity scoring

Core Capabilities

Achieves 72.74% Accuracy@1 on Legal Zalo 2021 dataset
Outperforms Vietnamese-bi-encoder (BKAI) and BGE-M3 in benchmarks
Excellent for Vietnamese text similarity and retrieval tasks
Robust performance across various accuracy metrics (MRR@10: 0.8181)

Frequently Asked Questions

Q: What makes this model unique?

The model's unique strength lies in its specialized optimization for Vietnamese language processing, achieving state-of-the-art performance on Vietnamese text retrieval tasks while maintaining efficient computation through dot product similarity.

Q: What are the recommended use cases?

The model is ideal for Vietnamese language applications including semantic search, document similarity analysis, and information retrieval tasks. It's particularly effective for legal document processing, as demonstrated by its performance on the Legal Zalo 2021 dataset.