ModernBERT-Ja-70M

Property	Value
Parameter Count	70M (31M without embeddings)
Context Length	8,192 tokens
Training Data	4.39T tokens (Japanese & English)
License	MIT
Author	SB Intuitions

What is modernbert-ja-70m?

ModernBERT-Ja-70M is an innovative Japanese language model that combines local and global attention mechanisms to efficiently process long sequences. Developed by SB Intuitions, this model represents a significant advancement in Japanese language understanding, featuring a vocabulary size of 102,400 and the ability to handle sequences up to 8,192 tokens in length.

Implementation Details

The model was trained through a three-stage process: initial pre-training on 3.51T tokens, followed by two context extension phases using 430B and 450B tokens respectively. The architecture employs modern improvements like RoPE (Rotary Position Embedding) and uses a combination of global and local attention patterns (1 layer global + 2 layers local).

Model Dimension: 384
Intermediate Dimension: 1536
Number of Layers: 13
Head Dimension: 64
Sliding Window Size: 128

Core Capabilities

Masked Language Modeling with strong performance
Excellent results on various downstream tasks including JGLUE
Support for both Japanese and English text processing
Flash Attention 2 compatibility for improved performance
Direct raw sentence input without pre-tokenization

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its hybrid attention mechanism combining local and global attention, allowing it to process long sequences efficiently while maintaining high performance on various NLP tasks. It's also trained on a massive corpus of 4.39T tokens and incorporates modern architectural improvements.

Q: What are the recommended use cases?

The model excels at masked language modeling and is primarily designed for fine-tuning on downstream tasks. It performs particularly well on tasks like Japanese linguistic acceptability, natural language inference, and semantic textual similarity. However, it's not recommended for text generation tasks.