bert-base-japanese-char-v2
Property | Value |
---|---|
Architecture | BERT Base (12 layers, 768 hidden, 12 heads) |
Vocabulary Size | 6,144 tokens |
Training Data | Japanese Wikipedia (30M sentences, 4.0GB) |
License | Creative Commons Attribution-ShareAlike 3.0 |
Author | Tohoku NLP |
What is bert-base-japanese-char-v2?
bert-base-japanese-char-v2 is a specialized BERT model designed specifically for Japanese language processing. It implements character-level tokenization combined with word-level preprocessing, utilizing the Unidic 2.1.2 dictionary for initial word segmentation. This model represents a significant advancement in Japanese NLP, trained on an extensive corpus of Japanese Wikipedia data from August 2020.
Implementation Details
The model follows a sophisticated two-step tokenization process. First, it uses MeCab with the Unidic 2.1.2 dictionary for word-level tokenization, followed by character-level splitting. This unique approach, combined with whole word masking during training, enables better handling of Japanese text structure and semantics.
- Training utilized Cloud TPUs (v3-8) over approximately 5 days
- Implements 512 tokens per instance with 256 instances per batch
- Completed 1M training steps
- Incorporates whole word masking for improved masked language modeling
Core Capabilities
- Advanced Japanese text processing with character-level granularity
- Robust handling of complex Japanese writing systems
- Efficient text representation through 768-dimensional hidden states
- Optimized for Japanese linguistic structures through specialized tokenization
Frequently Asked Questions
Q: What makes this model unique?
This model's distinctive feature is its hybrid tokenization approach, combining word-level and character-level processing, specifically optimized for Japanese language. The implementation of whole word masking during training ensures better semantic understanding of Japanese text structure.
Q: What are the recommended use cases?
The model is particularly well-suited for Japanese natural language processing tasks including text classification, named entity recognition, and text analysis. Its character-level tokenization makes it especially effective for handling Japanese text with various writing systems and complex word structures.