bert-base-japanese-v3

Maintained By
tohoku-nlp

BERT Base Japanese v3

PropertyValue
LicenseApache 2.0
ArchitectureBERT Base (12 layers, 768 hidden, 12 heads)
Training DataCC-100 (74.3GB) + Wikipedia (4.9GB)
Vocabulary Size32,768 tokens

What is bert-base-japanese-v3?

bert-base-japanese-v3 is a specialized Japanese language model based on the BERT architecture, developed by Tohoku NLP. It implements word-level tokenization using the Unidic 2.1.2 dictionary, combined with WordPiece subword tokenization, making it particularly effective for Japanese text processing.

Implementation Details

The model underwent a sophisticated training process, with 1M steps on CC-100 corpus followed by 1M steps on Wikipedia data. It utilizes Cloud TPUs (v3-8) for training and implements whole word masking for the masked language modeling objective.

  • Word-level tokenization using MeCab with Unidic 2.1.2 dictionary
  • Subword tokenization using WordPiece algorithm
  • Training corpus combining CC-100 (392M sentences) and Wikipedia (34M sentences)
  • Whole word masking implementation for better contextual understanding

Core Capabilities

  • Advanced Japanese text processing and understanding
  • Efficient tokenization handling both word and subword levels
  • Robust performance on masked language modeling tasks
  • Suitable for various Japanese NLP applications

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its combination of word-level tokenization using Unidic 2.1.2 and whole word masking, along with its extensive training on both CC-100 and Wikipedia data, making it particularly effective for Japanese language tasks.

Q: What are the recommended use cases?

The model is well-suited for Japanese natural language processing tasks, including text classification, named entity recognition, and masked language modeling. It's particularly effective for applications requiring deep understanding of Japanese text structure.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.