bert-base-japanese-char-v3

Maintained By
tohoku-nlp

BERT Base Japanese Character-Level V3

PropertyValue
LicenseApache 2.0
ArchitectureBERT Base (12 layers, 768 hidden)
Training DataCC-100 (74.3GB) + Wikipedia (4.9GB)
Vocabulary Size7,027 tokens

What is bert-base-japanese-char-v3?

bert-base-japanese-char-v3 is a specialized BERT model designed for Japanese language processing that implements character-level tokenization combined with word-level preprocessing. Developed by Tohoku NLP, this model represents a sophisticated approach to Japanese text understanding, trained on an extensive corpus of CC-100 and Wikipedia data.

Implementation Details

The model employs a unique two-step tokenization process: first using MeCab with Unidic 2.1.2 dictionary for word-level tokenization, followed by character-level splitting. Training was conducted in two phases - 1M steps on CC-100 followed by 1M steps on Wikipedia, utilizing TPU v3-8 instances.

  • Character-level tokenization with whole word masking
  • 12 transformer layers with 768-dimensional hidden states
  • 12 attention heads
  • Trained on 392M sentences from CC-100 and 34M from Wikipedia

Core Capabilities

  • Advanced Japanese text understanding and representation
  • Efficient handling of complex Japanese character systems
  • Robust performance through whole word masking
  • Suitable for various downstream NLP tasks

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its hybrid tokenization approach, combining word-level preprocessing with character-level tokenization, enhanced by whole word masking during training. This makes it particularly effective for Japanese text processing.

Q: What are the recommended use cases?

The model is well-suited for Japanese language tasks including text classification, named entity recognition, and question answering. Its character-level approach makes it particularly effective for handling Japanese text with complex character combinations.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.