ModernBERT-Ja-30M
Property | Value |
---|---|
Total Parameters | 37M |
Architecture Parameters | 10M (without embeddings) |
Dimension | 256 |
License | MIT |
Model Type | Masked Language Model |
Context Length | 8,192 tokens |
What is modernbert-ja-30m?
ModernBERT-Ja-30M is an innovative Japanese language model that combines local and global attention mechanisms to process long sequences efficiently. Developed by SB Intuitions, it's trained on a massive corpus of 4.39T tokens of Japanese and English text, featuring a vocabulary size of 102,400. The model represents a significant advancement in Japanese language processing, incorporating modern architectural improvements like RoPE (Rotary Position Embedding).
Implementation Details
The model employs a sophisticated three-stage training process: initial pre-training on 3.51T tokens, followed by two context extension phases using high-quality data. The architecture features 10 layers with an intermediate dimension of 1024, utilizing a combination of global and local attention patterns (1 layer + 2 layers).
- Sliding window attention with 128-token context size
- Global RoPE theta: 160,000
- Local RoPE theta: 10,000
- GELU activation function
- Unigram language model tokenizer with byte fallback
Core Capabilities
- Masked language modeling with strong performance on short sequences
- Efficient processing of sequences up to 8,192 tokens
- Achieves 85.67% average score across 12 evaluation datasets
- Excellent performance on various tasks including JGLUE benchmarks
- Direct raw text processing without pre-tokenization
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its hybrid attention mechanism combining local and global attention, allowing it to handle long sequences efficiently while maintaining strong performance on shorter texts. It's also notable for its extensive training on both Japanese and English data, totaling 4.39T tokens.
Q: What are the recommended use cases?
The model is primarily designed for masked language modeling and fine-tuning on downstream tasks. It's particularly effective for tasks requiring understanding of Japanese text, though it's not recommended for text generation tasks or token classification tasks like named entity recognition.