bert-base-japanese-char

Maintained By
tohoku-nlp

BERT Base Japanese (Character Tokenization)

PropertyValue
LicenseCC-BY-SA-4.0
Training DataJapanese Wikipedia
Architecture12 layers, 768 hidden dimensions, 12 attention heads
Vocabulary Size4000 tokens

What is bert-base-japanese-char?

bert-base-japanese-char is a specialized BERT model designed for Japanese language processing, developed by Tohoku NLP. It implements a unique two-step tokenization process, combining word-level tokenization using the IPA dictionary with character-level tokenization, making it particularly effective for Japanese text analysis.

Implementation Details

The model was trained on Japanese Wikipedia data from September 2019, comprising approximately 17M sentences across 2.6GB of text. It uses MeCab morphological parser with the IPA dictionary for initial tokenization, followed by character-level splitting. The training configuration mirrors the original BERT with 512 tokens per instance, 256 instances per batch, and 1M training steps.

  • Character-level tokenization after word-level preprocessing
  • 4000-token vocabulary size
  • Trained on massive Wikipedia corpus
  • Utilizes Cloud TPUs for training

Core Capabilities

  • Japanese text analysis and understanding
  • Masked language modeling for Japanese text
  • Support for both word and character level processing
  • Optimized for Japanese language patterns

Frequently Asked Questions

Q: What makes this model unique?

This model's distinctive feature is its hybrid tokenization approach, combining word-level and character-level processing, which is particularly suited for Japanese language's complex writing system. The character-level tokenization helps handle the diverse character types in Japanese (kanji, hiragana, katakana) effectively.

Q: What are the recommended use cases?

The model is ideal for Japanese natural language processing tasks, including text classification, named entity recognition, and masked language modeling. It's particularly effective for applications requiring detailed understanding of Japanese text structure and meaning.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.