IndoBERT Base Model (Phase 1)

Property	Value
Parameter Count	124.5M
Training Data	Indo4B (23.43 GB)
License	MIT
Paper	View Paper

What is indobert-base-p1?

IndoBERT base-p1 is a state-of-the-art language model specifically designed for Indonesian language processing. It's built on the BERT architecture and trained on a massive 23.43GB dataset called Indo4B. This model represents the first phase of the base architecture series in the IndoBERT family.

Implementation Details

The model implements a masked language modeling (MLM) objective combined with next sentence prediction (NSP). It utilizes the BERT architecture with 124.5M parameters, making it a balanced choice between computational efficiency and performance.

Built on PyTorch framework
Supports feature extraction capabilities
Implements transformer architecture
Uses uncased tokenization

Core Capabilities

Contextual word embeddings for Indonesian text
Next sentence prediction for text coherence
Masked language modeling for bidirectional context understanding
Supports both inference and fine-tuning tasks

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically trained on Indonesian language data, making it highly effective for Indonesian NLP tasks. It's part of a larger family of IndoBERT models, offering different sizes and capabilities for various use cases.

Q: What are the recommended use cases?

The model is ideal for Indonesian language processing tasks including text classification, named entity recognition, question answering, and general language understanding tasks. It's particularly suitable for applications requiring deep understanding of Indonesian language context.

indobert-base-p1