bert-base-japanese-v2

Maintained By
tohoku-nlp

BERT Base Japanese V2

PropertyValue
LicenseCC-BY-SA 4.0
Training DataJapanese Wikipedia (30M sentences)
Architecture12 layers, 768-dim hidden states, 12 attention heads
Vocabulary Size32,768 tokens

What is bert-base-japanese-v2?

bert-base-japanese-v2 is a specialized BERT model pretrained on Japanese text, incorporating word-level tokenization using the Unidic 2.1.2 dictionary combined with WordPiece subword tokenization. This model represents a significant advancement in Japanese language processing, trained on an extensive Wikipedia corpus with whole word masking techniques.

Implementation Details

The model employs a sophisticated tokenization pipeline using MeCab with the Unidic 2.1.2 dictionary, followed by WordPiece subword tokenization. Training was conducted on Cloud TPUs (v3-8) for approximately 5 days, processing 512 tokens per instance and 256 instances per batch over 1M training steps.

  • Utilizes fugashi and unidic-lite packages for tokenization
  • Implements whole word masking for improved masked language modeling
  • Trained on 4.0GB of Japanese Wikipedia data from August 31, 2020

Core Capabilities

  • Advanced Japanese text processing with word-level understanding
  • Masked language modeling with whole word masking
  • Efficient handling of Japanese-specific linguistic features
  • Suitable for various downstream NLP tasks in Japanese

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its combination of word-level tokenization using Unidic 2.1.2 and WordPiece subword tokenization, along with whole word masking during training. This approach provides better handling of Japanese language characteristics compared to character-based models.

Q: What are the recommended use cases?

The model is particularly well-suited for Japanese language processing tasks such as text classification, named entity recognition, and masked language modeling. It's especially effective for applications requiring deep understanding of Japanese word semantics and context.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.