IndoBERTweet

Property	Value
Author	indolem
License	Apache 2.0
Paper	View Paper
Language	Indonesian

What is indobertweet-base-uncased?

IndoBERTweet is a groundbreaking language model specifically designed for Indonesian Twitter content. It represents the first large-scale pretrained model for this domain, built by extending a monolingually trained Indonesian BERT model with domain-specific vocabulary. The model was trained on an impressive dataset of 409M word tokens collected from Indonesian tweets between December 2019 and December 2020.

Implementation Details

The model employs an innovative approach to vocabulary initialization, utilizing average-pooling of BERT subword embeddings rather than traditional word2vec projections or training from scratch. This method has proven both more efficient and effective in practice.

Preprocessing includes lowercase conversion, standardizing user mentions to @USER and URLs to HTTPURL
Emoticons are translated into text using the emoji package
Training data covers diverse topics including economy, health, education, and government

Core Capabilities

Sentiment Analysis (achieving up to 92.7% accuracy on SmSA dataset)
Emotion Detection (79.0% accuracy on EmoT dataset)
Hate Speech Detection (up to 88.8% accuracy)
Named Entity Recognition (up to 88.1% accuracy on formal text)
Superior performance compared to mBERT and standard IndoBERT models

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized vocabulary initialization technique and its focus on Indonesian Twitter content, making it the first of its kind. It demonstrates superior performance across various NLP tasks compared to general-purpose models.

Q: What are the recommended use cases?

The model is particularly well-suited for Indonesian social media text analysis, including sentiment analysis, emotion detection, hate speech detection, and named entity recognition in both formal and informal Indonesian text on Twitter.

indobertweet-base-uncased