bert-base-japanese-char-v3

bert-base-japanese-char-v3

tohoku-nlp

Japanese BERT model using character-level tokenization, trained on CC-100 & Wikipedia data with whole word masking. Features 12-layer architecture with 768-dim hidden states.

PropertyValue
LicenseApache 2.0
ArchitectureBERT Base (12 layers, 768 hidden)
Training DataCC-100 (74.3GB) + Wikipedia (4.9GB)
Vocabulary Size7,027 tokens

What is bert-base-japanese-char-v3?

bert-base-japanese-char-v3 is a specialized BERT model designed for Japanese language processing that implements character-level tokenization combined with word-level preprocessing. Developed by Tohoku NLP, this model represents a sophisticated approach to Japanese text understanding, trained on an extensive corpus of CC-100 and Wikipedia data.

Implementation Details

The model employs a unique two-step tokenization process: first using MeCab with Unidic 2.1.2 dictionary for word-level tokenization, followed by character-level splitting. Training was conducted in two phases - 1M steps on CC-100 followed by 1M steps on Wikipedia, utilizing TPU v3-8 instances.

  • Character-level tokenization with whole word masking
  • 12 transformer layers with 768-dimensional hidden states
  • 12 attention heads
  • Trained on 392M sentences from CC-100 and 34M from Wikipedia

Core Capabilities

  • Advanced Japanese text understanding and representation
  • Efficient handling of complex Japanese character systems
  • Robust performance through whole word masking
  • Suitable for various downstream NLP tasks

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its hybrid tokenization approach, combining word-level preprocessing with character-level tokenization, enhanced by whole word masking during training. This makes it particularly effective for Japanese text processing.

Q: What are the recommended use cases?

The model is well-suited for Japanese language tasks including text classification, named entity recognition, and question answering. Its character-level approach makes it particularly effective for handling Japanese text with complex character combinations.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026