BERT Base Japanese Character-Level V3

Property	Value
License	Apache 2.0
Architecture	BERT Base (12 layers, 768 hidden)
Training Data	CC-100 (74.3GB) + Wikipedia (4.9GB)
Vocabulary Size	7,027 tokens

What is bert-base-japanese-char-v3?

bert-base-japanese-char-v3 is a specialized BERT model designed for Japanese language processing that implements character-level tokenization combined with word-level preprocessing. Developed by Tohoku NLP, this model represents a sophisticated approach to Japanese text understanding, trained on an extensive corpus of CC-100 and Wikipedia data.

Implementation Details

The model employs a unique two-step tokenization process: first using MeCab with Unidic 2.1.2 dictionary for word-level tokenization, followed by character-level splitting. Training was conducted in two phases - 1M steps on CC-100 followed by 1M steps on Wikipedia, utilizing TPU v3-8 instances.

Character-level tokenization with whole word masking
12 transformer layers with 768-dimensional hidden states
12 attention heads
Trained on 392M sentences from CC-100 and 34M from Wikipedia

Core Capabilities

Advanced Japanese text understanding and representation
Efficient handling of complex Japanese character systems
Robust performance through whole word masking
Suitable for various downstream NLP tasks

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its hybrid tokenization approach, combining word-level preprocessing with character-level tokenization, enhanced by whole word masking during training. This makes it particularly effective for Japanese text processing.

Q: What are the recommended use cases?

The model is well-suited for Japanese language tasks including text classification, named entity recognition, and question answering. Its character-level approach makes it particularly effective for handling Japanese text with complex character combinations.

bert-base-japanese-char-v3