visobert-14gb-corpus

Property	Value
Parameter Count	97.6M
Model Type	Fill-Mask Transformer
Architecture	XLM-RoBERTa
Tensor Type	F32

What is visobert-14gb-corpus?

visobert-14gb-corpus is an advanced Vietnamese language model that builds upon the uitnlp/visobert architecture, pre-trained on a massive 14GB dataset comprising 100M Facebook comments, 15M Facebook posts, UIT data, and MC4 ecommerce content. This model represents a significant advancement in Vietnamese natural language processing, achieving state-of-the-art performance across multiple social media text analysis tasks.

Implementation Details

The model utilizes the transformers library and implements a fill-mask pipeline architecture. It has been fine-tuned with precise configurations including AdamW optimizer, 30 training epochs, and a maximum sequence length of 128 tokens. The training process employed various batch sizes and learning rate schedulers optimized for different downstream tasks.

Pre-trained on diverse Vietnamese social media content
Achieves 82.2% average Macro F1-score across tasks
Optimized for emotion recognition, hate speech detection, spam detection, and hate speech spans detection

Core Capabilities

Emotion Recognition: 68.69% accuracy
Hate Speech Detection: 88.79% accuracy
Spam Reviews Detection: 91.02% accuracy
Hate Speech Spans Detection: 93.69% accuracy

Frequently Asked Questions

Q: What makes this model unique?

This model distinguishes itself through its comprehensive training on a diverse 14GB Vietnamese corpus and superior performance across social media analysis tasks, consistently outperforming predecessors like PhoBERT and viBERT.

Q: What are the recommended use cases?

The model is particularly well-suited for Vietnamese social media text analysis, including emotion detection, content moderation, spam detection, and hate speech identification in social media platforms and e-commerce applications.