ModernBERT-Ja-70M
Property | Value |
---|---|
Parameter Count | 70M (31M without embeddings) |
Context Length | 8,192 tokens |
Training Data | 4.39T tokens (Japanese & English) |
License | MIT |
Author | SB Intuitions |
What is modernbert-ja-70m?
ModernBERT-Ja-70M is an innovative Japanese language model that combines local and global attention mechanisms to efficiently process long sequences. Developed by SB Intuitions, this model represents a significant advancement in Japanese language understanding, featuring a vocabulary size of 102,400 and the ability to handle sequences up to 8,192 tokens in length.
Implementation Details
The model was trained through a three-stage process: initial pre-training on 3.51T tokens, followed by two context extension phases using 430B and 450B tokens respectively. The architecture employs modern improvements like RoPE (Rotary Position Embedding) and uses a combination of global and local attention patterns (1 layer global + 2 layers local).
- Model Dimension: 384
- Intermediate Dimension: 1536
- Number of Layers: 13
- Head Dimension: 64
- Sliding Window Size: 128
Core Capabilities
- Masked Language Modeling with strong performance
- Excellent results on various downstream tasks including JGLUE
- Support for both Japanese and English text processing
- Flash Attention 2 compatibility for improved performance
- Direct raw sentence input without pre-tokenization
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its hybrid attention mechanism combining local and global attention, allowing it to process long sequences efficiently while maintaining high performance on various NLP tasks. It's also trained on a massive corpus of 4.39T tokens and incorporates modern architectural improvements.
Q: What are the recommended use cases?
The model excels at masked language modeling and is primarily designed for fine-tuning on downstream tasks. It performs particularly well on tasks like Japanese linguistic acceptability, natural language inference, and semantic textual similarity. However, it's not recommended for text generation tasks.