BERT Base Japanese Character-Level V3
Property | Value |
---|---|
License | Apache 2.0 |
Architecture | BERT Base (12 layers, 768 hidden) |
Training Data | CC-100 (74.3GB) + Wikipedia (4.9GB) |
Vocabulary Size | 7,027 tokens |
What is bert-base-japanese-char-v3?
bert-base-japanese-char-v3 is a specialized BERT model designed for Japanese language processing that implements character-level tokenization combined with word-level preprocessing. Developed by Tohoku NLP, this model represents a sophisticated approach to Japanese text understanding, trained on an extensive corpus of CC-100 and Wikipedia data.
Implementation Details
The model employs a unique two-step tokenization process: first using MeCab with Unidic 2.1.2 dictionary for word-level tokenization, followed by character-level splitting. Training was conducted in two phases - 1M steps on CC-100 followed by 1M steps on Wikipedia, utilizing TPU v3-8 instances.
- Character-level tokenization with whole word masking
- 12 transformer layers with 768-dimensional hidden states
- 12 attention heads
- Trained on 392M sentences from CC-100 and 34M from Wikipedia
Core Capabilities
- Advanced Japanese text understanding and representation
- Efficient handling of complex Japanese character systems
- Robust performance through whole word masking
- Suitable for various downstream NLP tasks
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its hybrid tokenization approach, combining word-level preprocessing with character-level tokenization, enhanced by whole word masking during training. This makes it particularly effective for Japanese text processing.
Q: What are the recommended use cases?
The model is well-suited for Japanese language tasks including text classification, named entity recognition, and question answering. Its character-level approach makes it particularly effective for handling Japanese text with complex character combinations.