bert-base-japanese-char-v2

Maintained By
tohoku-nlp

bert-base-japanese-char-v2

PropertyValue
ArchitectureBERT Base (12 layers, 768 hidden, 12 heads)
Vocabulary Size6,144 tokens
Training DataJapanese Wikipedia (30M sentences, 4.0GB)
LicenseCreative Commons Attribution-ShareAlike 3.0
AuthorTohoku NLP

What is bert-base-japanese-char-v2?

bert-base-japanese-char-v2 is a specialized BERT model designed specifically for Japanese language processing. It implements character-level tokenization combined with word-level preprocessing, utilizing the Unidic 2.1.2 dictionary for initial word segmentation. This model represents a significant advancement in Japanese NLP, trained on an extensive corpus of Japanese Wikipedia data from August 2020.

Implementation Details

The model follows a sophisticated two-step tokenization process. First, it uses MeCab with the Unidic 2.1.2 dictionary for word-level tokenization, followed by character-level splitting. This unique approach, combined with whole word masking during training, enables better handling of Japanese text structure and semantics.

  • Training utilized Cloud TPUs (v3-8) over approximately 5 days
  • Implements 512 tokens per instance with 256 instances per batch
  • Completed 1M training steps
  • Incorporates whole word masking for improved masked language modeling

Core Capabilities

  • Advanced Japanese text processing with character-level granularity
  • Robust handling of complex Japanese writing systems
  • Efficient text representation through 768-dimensional hidden states
  • Optimized for Japanese linguistic structures through specialized tokenization

Frequently Asked Questions

Q: What makes this model unique?

This model's distinctive feature is its hybrid tokenization approach, combining word-level and character-level processing, specifically optimized for Japanese language. The implementation of whole word masking during training ensures better semantic understanding of Japanese text structure.

Q: What are the recommended use cases?

The model is particularly well-suited for Japanese natural language processing tasks including text classification, named entity recognition, and text analysis. Its character-level tokenization makes it especially effective for handling Japanese text with various writing systems and complex word structures.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.