BERT Base Japanese V2

Property	Value
License	CC-BY-SA 4.0
Training Data	Japanese Wikipedia (30M sentences)
Architecture	12 layers, 768-dim hidden states, 12 attention heads
Vocabulary Size	32,768 tokens

What is bert-base-japanese-v2?

bert-base-japanese-v2 is a specialized BERT model pretrained on Japanese text, incorporating word-level tokenization using the Unidic 2.1.2 dictionary combined with WordPiece subword tokenization. This model represents a significant advancement in Japanese language processing, trained on an extensive Wikipedia corpus with whole word masking techniques.

Implementation Details

The model employs a sophisticated tokenization pipeline using MeCab with the Unidic 2.1.2 dictionary, followed by WordPiece subword tokenization. Training was conducted on Cloud TPUs (v3-8) for approximately 5 days, processing 512 tokens per instance and 256 instances per batch over 1M training steps.

Utilizes fugashi and unidic-lite packages for tokenization
Implements whole word masking for improved masked language modeling
Trained on 4.0GB of Japanese Wikipedia data from August 31, 2020

Core Capabilities

Advanced Japanese text processing with word-level understanding
Masked language modeling with whole word masking
Efficient handling of Japanese-specific linguistic features
Suitable for various downstream NLP tasks in Japanese

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its combination of word-level tokenization using Unidic 2.1.2 and WordPiece subword tokenization, along with whole word masking during training. This approach provides better handling of Japanese language characteristics compared to character-based models.

Q: What are the recommended use cases?

The model is particularly well-suited for Japanese language processing tasks such as text classification, named entity recognition, and masked language modeling. It's especially effective for applications requiring deep understanding of Japanese word semantics and context.