BERT Base Japanese V2
Property | Value |
---|---|
License | CC-BY-SA 4.0 |
Training Data | Japanese Wikipedia (30M sentences) |
Architecture | 12 layers, 768-dim hidden states, 12 attention heads |
Vocabulary Size | 32,768 tokens |
What is bert-base-japanese-v2?
bert-base-japanese-v2 is a specialized BERT model pretrained on Japanese text, incorporating word-level tokenization using the Unidic 2.1.2 dictionary combined with WordPiece subword tokenization. This model represents a significant advancement in Japanese language processing, trained on an extensive Wikipedia corpus with whole word masking techniques.
Implementation Details
The model employs a sophisticated tokenization pipeline using MeCab with the Unidic 2.1.2 dictionary, followed by WordPiece subword tokenization. Training was conducted on Cloud TPUs (v3-8) for approximately 5 days, processing 512 tokens per instance and 256 instances per batch over 1M training steps.
- Utilizes fugashi and unidic-lite packages for tokenization
- Implements whole word masking for improved masked language modeling
- Trained on 4.0GB of Japanese Wikipedia data from August 31, 2020
Core Capabilities
- Advanced Japanese text processing with word-level understanding
- Masked language modeling with whole word masking
- Efficient handling of Japanese-specific linguistic features
- Suitable for various downstream NLP tasks in Japanese
Frequently Asked Questions
Q: What makes this model unique?
This model's uniqueness lies in its combination of word-level tokenization using Unidic 2.1.2 and WordPiece subword tokenization, along with whole word masking during training. This approach provides better handling of Japanese language characteristics compared to character-based models.
Q: What are the recommended use cases?
The model is particularly well-suited for Japanese language processing tasks such as text classification, named entity recognition, and masked language modeling. It's especially effective for applications requiring deep understanding of Japanese word semantics and context.