bert-large-japanese

Maintained By
tohoku-nlp

BERT Large Japanese

PropertyValue
ArchitectureBERT Large (24 layers, 1024 hidden dims)
Training DataJapanese Wikipedia (30M sentences)
LicenseCC-BY-SA-4.0
Vocabulary Size32,768 tokens

What is bert-large-japanese?

bert-large-japanese is a specialized BERT model designed specifically for Japanese language processing. Developed by Tohoku NLP, it implements the BERT large architecture with sophisticated Japanese-specific tokenization combining Unidic word-level segmentation and WordPiece subword tokenization.

Implementation Details

The model follows the BERT large architecture with 24 transformer layers, 1024-dimensional hidden states, and 16 attention heads. It was trained on a 4.0GB Japanese Wikipedia corpus using Cloud TPUs, incorporating whole word masking for improved linguistic understanding.

  • Dual-stage tokenization using MeCab with Unidic 2.1.2 dictionary and WordPiece
  • Training configuration: 512 tokens per instance, 256 batch size
  • 1M training steps with whole word masking for MLM objective
  • Trained using TensorFlow Research Cloud TPUs (v3-8)

Core Capabilities

  • Advanced Japanese text understanding and representation
  • Masked language modeling with whole word masking
  • Support for both word-level and subword tokenization
  • Optimized for Japanese linguistic structures

Frequently Asked Questions

Q: What makes this model unique?

This model combines BERT large architecture with Japanese-specific tokenization and whole word masking, making it particularly effective for Japanese language tasks. The dual tokenization approach using Unidic and WordPiece provides better handling of Japanese language structures.

Q: What are the recommended use cases?

The model is ideal for Japanese natural language processing tasks, including text classification, named entity recognition, and masked language modeling. It's particularly suited for applications requiring deep understanding of Japanese language structures and context.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.