WangchanBERTa Base Model
Property | Value |
---|---|
Parameter Count | 106M |
Training Data Size | 78.5GB |
Architecture | RoBERTa-based |
Tensor Type | F32 |
Language | Thai |
What is wangchanberta-base-att-spm-uncased?
WangchanBERTa is a sophisticated Thai language model based on the RoBERTa architecture, designed specifically for Thai natural language processing tasks. Trained on an extensive corpus of 78.5GB of Thai text, this model represents a significant advancement in Thai language understanding and processing capabilities.
Implementation Details
The model employs a SentencePiece tokenizer with a vocabulary size of 25,000 subwords, trained on 15M sentences. It processes sequences up to 416 subword tokens in length and implements masked language modeling with a 15% masking rate. The training process utilized 8 V100 GPUs for 500,000 steps with a batch size of 4,096.
- Comprehensive preprocessing pipeline including HTML cleanup and Thai-specific text normalization
- Advanced tokenization using dictionary-based maximal matching
- Specialized space handling with explicit marking using <\_>
- Training corpus of 381,034,638 unique Thai sentences
Core Capabilities
- Masked Language Modeling
- Multiclass Text Classification (sentiment analysis, review ratings)
- Multilabel Text Classification (news categorization)
- Token Classification (Named Entity Recognition, POS tagging)
- Support for fine-tuning on specific Thai language tasks
Frequently Asked Questions
Q: What makes this model unique?
The model's specialization in Thai language processing, combined with its extensive training data and careful preprocessing considerations for Thai-specific linguistic features, makes it particularly effective for Thai NLP tasks. Its architecture is optimized for Thai language understanding while maintaining compatibility with modern transformer-based approaches.
Q: What are the recommended use cases?
The model excels in various Thai language processing tasks, including sentiment analysis, review classification, named entity recognition, and topic classification. It's particularly well-suited for tasks requiring deep understanding of Thai text structure and semantics.