bert-base-japanese-char-v2

Property	Value
Architecture	BERT Base (12 layers, 768 hidden, 12 heads)
Vocabulary Size	6,144 tokens
Training Data	Japanese Wikipedia (30M sentences, 4.0GB)
License	Creative Commons Attribution-ShareAlike 3.0
Author	Tohoku NLP

What is bert-base-japanese-char-v2?

bert-base-japanese-char-v2 is a specialized BERT model designed specifically for Japanese language processing. It implements character-level tokenization combined with word-level preprocessing, utilizing the Unidic 2.1.2 dictionary for initial word segmentation. This model represents a significant advancement in Japanese NLP, trained on an extensive corpus of Japanese Wikipedia data from August 2020.

Implementation Details

The model follows a sophisticated two-step tokenization process. First, it uses MeCab with the Unidic 2.1.2 dictionary for word-level tokenization, followed by character-level splitting. This unique approach, combined with whole word masking during training, enables better handling of Japanese text structure and semantics.

Training utilized Cloud TPUs (v3-8) over approximately 5 days
Implements 512 tokens per instance with 256 instances per batch
Completed 1M training steps
Incorporates whole word masking for improved masked language modeling

Core Capabilities

Advanced Japanese text processing with character-level granularity
Robust handling of complex Japanese writing systems
Efficient text representation through 768-dimensional hidden states
Optimized for Japanese linguistic structures through specialized tokenization

Frequently Asked Questions

Q: What makes this model unique?

This model's distinctive feature is its hybrid tokenization approach, combining word-level and character-level processing, specifically optimized for Japanese language. The implementation of whole word masking during training ensures better semantic understanding of Japanese text structure.

Q: What are the recommended use cases?

The model is particularly well-suited for Japanese natural language processing tasks including text classification, named entity recognition, and text analysis. Its character-level tokenization makes it especially effective for handling Japanese text with various writing systems and complex word structures.