bert-base-japanese-char-v2

bert-base-japanese-char-v2

tohoku-nlp

BERT base Japanese model using character-level tokenization with whole word masking, trained on Wikipedia data. Features 12 layers, 768-dim hidden states, 12 attention heads.

PropertyValue
ArchitectureBERT Base (12 layers, 768 hidden, 12 heads)
Vocabulary Size6,144 tokens
Training DataJapanese Wikipedia (30M sentences, 4.0GB)
LicenseCreative Commons Attribution-ShareAlike 3.0
AuthorTohoku NLP

What is bert-base-japanese-char-v2?

bert-base-japanese-char-v2 is a specialized BERT model designed specifically for Japanese language processing. It implements character-level tokenization combined with word-level preprocessing, utilizing the Unidic 2.1.2 dictionary for initial word segmentation. This model represents a significant advancement in Japanese NLP, trained on an extensive corpus of Japanese Wikipedia data from August 2020.

Implementation Details

The model follows a sophisticated two-step tokenization process. First, it uses MeCab with the Unidic 2.1.2 dictionary for word-level tokenization, followed by character-level splitting. This unique approach, combined with whole word masking during training, enables better handling of Japanese text structure and semantics.

  • Training utilized Cloud TPUs (v3-8) over approximately 5 days
  • Implements 512 tokens per instance with 256 instances per batch
  • Completed 1M training steps
  • Incorporates whole word masking for improved masked language modeling

Core Capabilities

  • Advanced Japanese text processing with character-level granularity
  • Robust handling of complex Japanese writing systems
  • Efficient text representation through 768-dimensional hidden states
  • Optimized for Japanese linguistic structures through specialized tokenization

Frequently Asked Questions

Q: What makes this model unique?

This model's distinctive feature is its hybrid tokenization approach, combining word-level and character-level processing, specifically optimized for Japanese language. The implementation of whole word masking during training ensures better semantic understanding of Japanese text structure.

Q: What are the recommended use cases?

The model is particularly well-suited for Japanese natural language processing tasks including text classification, named entity recognition, and text analysis. Its character-level tokenization makes it especially effective for handling Japanese text with various writing systems and complex word structures.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026