visobert-14gb-corpus
Property | Value |
---|---|
Parameter Count | 97.6M |
Model Type | Fill-Mask Transformer |
Architecture | XLM-RoBERTa |
Tensor Type | F32 |
What is visobert-14gb-corpus?
visobert-14gb-corpus is an advanced Vietnamese language model that builds upon the uitnlp/visobert architecture, pre-trained on a massive 14GB dataset comprising 100M Facebook comments, 15M Facebook posts, UIT data, and MC4 ecommerce content. This model represents a significant advancement in Vietnamese natural language processing, achieving state-of-the-art performance across multiple social media text analysis tasks.
Implementation Details
The model utilizes the transformers library and implements a fill-mask pipeline architecture. It has been fine-tuned with precise configurations including AdamW optimizer, 30 training epochs, and a maximum sequence length of 128 tokens. The training process employed various batch sizes and learning rate schedulers optimized for different downstream tasks.
- Pre-trained on diverse Vietnamese social media content
- Achieves 82.2% average Macro F1-score across tasks
- Optimized for emotion recognition, hate speech detection, spam detection, and hate speech spans detection
Core Capabilities
- Emotion Recognition: 68.69% accuracy
- Hate Speech Detection: 88.79% accuracy
- Spam Reviews Detection: 91.02% accuracy
- Hate Speech Spans Detection: 93.69% accuracy
Frequently Asked Questions
Q: What makes this model unique?
This model distinguishes itself through its comprehensive training on a diverse 14GB Vietnamese corpus and superior performance across social media analysis tasks, consistently outperforming predecessors like PhoBERT and viBERT.
Q: What are the recommended use cases?
The model is particularly well-suited for Vietnamese social media text analysis, including emotion detection, content moderation, spam detection, and hate speech identification in social media platforms and e-commerce applications.