wangchanberta-base-att-spm-uncased

Maintained By
airesearch

WangchanBERTa Base Model

PropertyValue
Parameter Count106M
Training Data Size78.5GB
ArchitectureRoBERTa-based
Tensor TypeF32
LanguageThai

What is wangchanberta-base-att-spm-uncased?

WangchanBERTa is a sophisticated Thai language model based on the RoBERTa architecture, designed specifically for Thai natural language processing tasks. Trained on an extensive corpus of 78.5GB of Thai text, this model represents a significant advancement in Thai language understanding and processing capabilities.

Implementation Details

The model employs a SentencePiece tokenizer with a vocabulary size of 25,000 subwords, trained on 15M sentences. It processes sequences up to 416 subword tokens in length and implements masked language modeling with a 15% masking rate. The training process utilized 8 V100 GPUs for 500,000 steps with a batch size of 4,096.

  • Comprehensive preprocessing pipeline including HTML cleanup and Thai-specific text normalization
  • Advanced tokenization using dictionary-based maximal matching
  • Specialized space handling with explicit marking using <\_>
  • Training corpus of 381,034,638 unique Thai sentences

Core Capabilities

  • Masked Language Modeling
  • Multiclass Text Classification (sentiment analysis, review ratings)
  • Multilabel Text Classification (news categorization)
  • Token Classification (Named Entity Recognition, POS tagging)
  • Support for fine-tuning on specific Thai language tasks

Frequently Asked Questions

Q: What makes this model unique?

The model's specialization in Thai language processing, combined with its extensive training data and careful preprocessing considerations for Thai-specific linguistic features, makes it particularly effective for Thai NLP tasks. Its architecture is optimized for Thai language understanding while maintaining compatibility with modern transformer-based approaches.

Q: What are the recommended use cases?

The model excels in various Thai language processing tasks, including sentiment analysis, review classification, named entity recognition, and topic classification. It's particularly well-suited for tasks requiring deep understanding of Thai text structure and semantics.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.