ModernBERT-Ja-30M

Property	Value
Total Parameters	37M
Architecture Parameters	10M (without embeddings)
Dimension	256
License	MIT
Model Type	Masked Language Model
Context Length	8,192 tokens

What is modernbert-ja-30m?

ModernBERT-Ja-30M is an innovative Japanese language model that combines local and global attention mechanisms to process long sequences efficiently. Developed by SB Intuitions, it's trained on a massive corpus of 4.39T tokens of Japanese and English text, featuring a vocabulary size of 102,400. The model represents a significant advancement in Japanese language processing, incorporating modern architectural improvements like RoPE (Rotary Position Embedding).

Implementation Details

The model employs a sophisticated three-stage training process: initial pre-training on 3.51T tokens, followed by two context extension phases using high-quality data. The architecture features 10 layers with an intermediate dimension of 1024, utilizing a combination of global and local attention patterns (1 layer + 2 layers).

Sliding window attention with 128-token context size
Global RoPE theta: 160,000
Local RoPE theta: 10,000
GELU activation function
Unigram language model tokenizer with byte fallback

Core Capabilities

Masked language modeling with strong performance on short sequences
Efficient processing of sequences up to 8,192 tokens
Achieves 85.67% average score across 12 evaluation datasets
Excellent performance on various tasks including JGLUE benchmarks
Direct raw text processing without pre-tokenization

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its hybrid attention mechanism combining local and global attention, allowing it to handle long sequences efficiently while maintaining strong performance on shorter texts. It's also notable for its extensive training on both Japanese and English data, totaling 4.39T tokens.

Q: What are the recommended use cases?

The model is primarily designed for masked language modeling and fine-tuning on downstream tasks. It's particularly effective for tasks requiring understanding of Japanese text, though it's not recommended for text generation tasks or token classification tasks like named entity recognition.