BERT Base Japanese v3

Property	Value
License	Apache 2.0
Architecture	BERT Base (12 layers, 768 hidden, 12 heads)
Training Data	CC-100 (74.3GB) + Wikipedia (4.9GB)
Vocabulary Size	32,768 tokens

What is bert-base-japanese-v3?

bert-base-japanese-v3 is a specialized Japanese language model based on the BERT architecture, developed by Tohoku NLP. It implements word-level tokenization using the Unidic 2.1.2 dictionary, combined with WordPiece subword tokenization, making it particularly effective for Japanese text processing.

Implementation Details

The model underwent a sophisticated training process, with 1M steps on CC-100 corpus followed by 1M steps on Wikipedia data. It utilizes Cloud TPUs (v3-8) for training and implements whole word masking for the masked language modeling objective.

Word-level tokenization using MeCab with Unidic 2.1.2 dictionary
Subword tokenization using WordPiece algorithm
Training corpus combining CC-100 (392M sentences) and Wikipedia (34M sentences)
Whole word masking implementation for better contextual understanding

Core Capabilities

Advanced Japanese text processing and understanding
Efficient tokenization handling both word and subword levels
Robust performance on masked language modeling tasks
Suitable for various Japanese NLP applications

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its combination of word-level tokenization using Unidic 2.1.2 and whole word masking, along with its extensive training on both CC-100 and Wikipedia data, making it particularly effective for Japanese language tasks.

Q: What are the recommended use cases?

The model is well-suited for Japanese natural language processing tasks, including text classification, named entity recognition, and masked language modeling. It's particularly effective for applications requiring deep understanding of Japanese text structure.