bert-base-japanese-v3

bert-base-japanese-v3

tohoku-nlp

BERT base Japanese model trained on CC-100 and Wikipedia, featuring word-level tokenization with Unidic 2.1.2 dictionary and whole word masking capability.

PropertyValue
LicenseApache 2.0
ArchitectureBERT Base (12 layers, 768 hidden, 12 heads)
Training DataCC-100 (74.3GB) + Wikipedia (4.9GB)
Vocabulary Size32,768 tokens

What is bert-base-japanese-v3?

bert-base-japanese-v3 is a specialized Japanese language model based on the BERT architecture, developed by Tohoku NLP. It implements word-level tokenization using the Unidic 2.1.2 dictionary, combined with WordPiece subword tokenization, making it particularly effective for Japanese text processing.

Implementation Details

The model underwent a sophisticated training process, with 1M steps on CC-100 corpus followed by 1M steps on Wikipedia data. It utilizes Cloud TPUs (v3-8) for training and implements whole word masking for the masked language modeling objective.

  • Word-level tokenization using MeCab with Unidic 2.1.2 dictionary
  • Subword tokenization using WordPiece algorithm
  • Training corpus combining CC-100 (392M sentences) and Wikipedia (34M sentences)
  • Whole word masking implementation for better contextual understanding

Core Capabilities

  • Advanced Japanese text processing and understanding
  • Efficient tokenization handling both word and subword levels
  • Robust performance on masked language modeling tasks
  • Suitable for various Japanese NLP applications

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its combination of word-level tokenization using Unidic 2.1.2 and whole word masking, along with its extensive training on both CC-100 and Wikipedia data, making it particularly effective for Japanese language tasks.

Q: What are the recommended use cases?

The model is well-suited for Japanese natural language processing tasks, including text classification, named entity recognition, and masked language modeling. It's particularly effective for applications requiring deep understanding of Japanese text structure.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026