indobertweet-base-uncased

indobertweet-base-uncased

indolem

IndoBERTweet is a specialized BERT model for Indonesian Twitter, trained on 409M tokens with domain-specific vocabulary and effective initialization techniques.

PropertyValue
Authorindolem
LicenseApache 2.0
PaperView Paper
LanguageIndonesian

What is indobertweet-base-uncased?

IndoBERTweet is a groundbreaking language model specifically designed for Indonesian Twitter content. It represents the first large-scale pretrained model for this domain, built by extending a monolingually trained Indonesian BERT model with domain-specific vocabulary. The model was trained on an impressive dataset of 409M word tokens collected from Indonesian tweets between December 2019 and December 2020.

Implementation Details

The model employs an innovative approach to vocabulary initialization, utilizing average-pooling of BERT subword embeddings rather than traditional word2vec projections or training from scratch. This method has proven both more efficient and effective in practice.

  • Preprocessing includes lowercase conversion, standardizing user mentions to @USER and URLs to HTTPURL
  • Emoticons are translated into text using the emoji package
  • Training data covers diverse topics including economy, health, education, and government

Core Capabilities

  • Sentiment Analysis (achieving up to 92.7% accuracy on SmSA dataset)
  • Emotion Detection (79.0% accuracy on EmoT dataset)
  • Hate Speech Detection (up to 88.8% accuracy)
  • Named Entity Recognition (up to 88.1% accuracy on formal text)
  • Superior performance compared to mBERT and standard IndoBERT models

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized vocabulary initialization technique and its focus on Indonesian Twitter content, making it the first of its kind. It demonstrates superior performance across various NLP tasks compared to general-purpose models.

Q: What are the recommended use cases?

The model is particularly well-suited for Indonesian social media text analysis, including sentiment analysis, emotion detection, hate speech detection, and named entity recognition in both formal and informal Indonesian text on Twitter.

Related Models

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026