WangchanBERTa Base Model

Property	Value
Parameter Count	106M
Training Data Size	78.5GB
Architecture	RoBERTa-based
Tensor Type	F32
Language	Thai

What is wangchanberta-base-att-spm-uncased?

WangchanBERTa is a sophisticated Thai language model based on the RoBERTa architecture, designed specifically for Thai natural language processing tasks. Trained on an extensive corpus of 78.5GB of Thai text, this model represents a significant advancement in Thai language understanding and processing capabilities.

Implementation Details

The model employs a SentencePiece tokenizer with a vocabulary size of 25,000 subwords, trained on 15M sentences. It processes sequences up to 416 subword tokens in length and implements masked language modeling with a 15% masking rate. The training process utilized 8 V100 GPUs for 500,000 steps with a batch size of 4,096.

Comprehensive preprocessing pipeline including HTML cleanup and Thai-specific text normalization
Advanced tokenization using dictionary-based maximal matching
Specialized space handling with explicit marking using <\_>
Training corpus of 381,034,638 unique Thai sentences

Core Capabilities

Masked Language Modeling
Multiclass Text Classification (sentiment analysis, review ratings)
Multilabel Text Classification (news categorization)
Token Classification (Named Entity Recognition, POS tagging)
Support for fine-tuning on specific Thai language tasks

Frequently Asked Questions

Q: What makes this model unique?

The model's specialization in Thai language processing, combined with its extensive training data and careful preprocessing considerations for Thai-specific linguistic features, makes it particularly effective for Thai NLP tasks. Its architecture is optimized for Thai language understanding while maintaining compatibility with modern transformer-based approaches.

Q: What are the recommended use cases?

The model excels in various Thai language processing tasks, including sentiment analysis, review classification, named entity recognition, and topic classification. It's particularly well-suited for tasks requiring deep understanding of Thai text structure and semantics.

wangchanberta-base-att-spm-uncased