wangchanberta-base-att-spm-uncased

wangchanberta-base-att-spm-uncased

airesearch

WangchanBERTa - A 106M parameter RoBERTa-based Thai language model trained on 78.5GB text, specialized in masked language modeling and classification tasks.

PropertyValue
Parameter Count106M
Training Data Size78.5GB
ArchitectureRoBERTa-based
Tensor TypeF32
LanguageThai

What is wangchanberta-base-att-spm-uncased?

WangchanBERTa is a sophisticated Thai language model based on the RoBERTa architecture, designed specifically for Thai natural language processing tasks. Trained on an extensive corpus of 78.5GB of Thai text, this model represents a significant advancement in Thai language understanding and processing capabilities.

Implementation Details

The model employs a SentencePiece tokenizer with a vocabulary size of 25,000 subwords, trained on 15M sentences. It processes sequences up to 416 subword tokens in length and implements masked language modeling with a 15% masking rate. The training process utilized 8 V100 GPUs for 500,000 steps with a batch size of 4,096.

  • Comprehensive preprocessing pipeline including HTML cleanup and Thai-specific text normalization
  • Advanced tokenization using dictionary-based maximal matching
  • Specialized space handling with explicit marking using <\_>
  • Training corpus of 381,034,638 unique Thai sentences

Core Capabilities

  • Masked Language Modeling
  • Multiclass Text Classification (sentiment analysis, review ratings)
  • Multilabel Text Classification (news categorization)
  • Token Classification (Named Entity Recognition, POS tagging)
  • Support for fine-tuning on specific Thai language tasks

Frequently Asked Questions

Q: What makes this model unique?

The model's specialization in Thai language processing, combined with its extensive training data and careful preprocessing considerations for Thai-specific linguistic features, makes it particularly effective for Thai NLP tasks. Its architecture is optimized for Thai language understanding while maintaining compatibility with modern transformer-based approaches.

Q: What are the recommended use cases?

The model excels in various Thai language processing tasks, including sentiment analysis, review classification, named entity recognition, and topic classification. It's particularly well-suited for tasks requiring deep understanding of Thai text structure and semantics.

Socials
Integrations
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026