bert-base-japanese-whole-word-masking

bert-base-japanese-whole-word-masking

tohoku-nlp

BERT base model for Japanese text processing with whole word masking, trained on Wikipedia data. Features 12-layer architecture with word-level IPA dictionary tokenization and 32K vocabulary.

PropertyValue
LicenseCC-BY-SA-4.0
Training DataJapanese Wikipedia
Architecture12 layers, 768 hidden dimensions, 12 attention heads
Vocabulary Size32,000 tokens

What is bert-base-japanese-whole-word-masking?

This is a specialized BERT model designed specifically for Japanese language processing, developed by Tohoku NLP. It implements whole word masking during pre-training, which means entire words are masked together during the masked language modeling task, rather than individual subword tokens. The model combines word-level tokenization using the IPA dictionary with WordPiece subword tokenization for optimal Japanese text processing.

Implementation Details

The model is trained on Japanese Wikipedia data from September 2019, comprising approximately 17M sentences across 2.6GB of text. It utilizes a two-step tokenization process: first using MeCab with the IPA dictionary for morphological analysis, followed by WordPiece tokenization.

  • Training configuration: 512 tokens per instance
  • Batch size: 256 instances
  • Training steps: 1M
  • Specialized whole word masking for Japanese text

Core Capabilities

  • Advanced Japanese text understanding and processing
  • Masked language modeling with whole word masking
  • Morphological analysis integration
  • Support for long-form text up to 512 tokens

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its specialized approach to Japanese language processing, combining IPA dictionary-based word tokenization with whole word masking during pre-training. This makes it particularly effective for Japanese text understanding tasks.

Q: What are the recommended use cases?

The model is ideal for Japanese language tasks including text classification, named entity recognition, and text completion. It's particularly well-suited for applications requiring deep understanding of Japanese language structure and context.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026