bert-base-uncased

google-bert

BERT base uncased (110M params) - Foundational transformer model for English language tasks with masked language modeling, trained on BookCorpus and Wikipedia.

Property	Value
Parameter Count	110M
License	Apache 2.0
Paper	View Paper
Training Data	BookCorpus + Wikipedia
Architecture	Transformer-based

What is bert-base-uncased?

BERT base uncased is a foundational transformer model that revolutionized natural language processing. Developed by Google, this 110M parameter model is trained on a massive corpus of lowercase English text, treating "english" and "English" identically. It uses innovative bidirectional training and implements two key pre-training objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Implementation Details

The model employs a sophisticated pre-training approach where 15% of input tokens are masked, with 80% replaced by [MASK] tokens, 10% by random tokens, and 10% left unchanged. It processes sequences up to 512 tokens long and uses WordPiece tokenization with a 30,000 token vocabulary.

Training utilized 4 cloud TPUs in Pod configuration
Trained for 1 million steps with 256 batch size
Uses Adam optimizer with 1e-4 learning rate
Implements learning rate warmup and linear decay

Core Capabilities

Masked language modeling for bidirectional context understanding
Next sentence prediction for document-level comprehension
Feature extraction for downstream tasks
High performance on GLUE benchmark tasks
Efficient fine-tuning capabilities

Frequently Asked Questions

Q: What makes this model unique?

This model's bidirectional training architecture sets it apart, allowing it to understand context from both directions simultaneously, unlike traditional left-to-right language models. Its masked language modeling approach enables deep bidirectional representations.

Q: What are the recommended use cases?

The model excels in tasks that require whole-sentence understanding, including sequence classification, token classification, and question answering. It's particularly effective when fine-tuned for specific downstream tasks but isn't recommended for text generation tasks.