t5-base-japanese
Property | Value |
---|---|
Parameters | 222M |
License | CC-BY SA 4.0 |
Training Data | Wikipedia, OSCAR, CC-100 |
Framework | PyTorch |
What is t5-base-japanese?
t5-base-japanese is a specialized Text-to-Text Transfer Transformer (T5) model pre-trained specifically for Japanese language tasks. Developed by sonoisa, this model leverages approximately 100GB of Japanese text from diverse sources including Wikipedia, OSCAR corpus, and CC-100 dataset. The model demonstrates superior performance compared to multilingual alternatives, particularly in tasks like news classification.
Implementation Details
The model utilizes a SentencePiece tokenizer trained on the complete Japanese Wikipedia dataset. With 222M parameters, it's 25% smaller than Google's mT5-small while achieving better performance. The model requires fine-tuning for specific downstream tasks but provides strong baseline performance.
- Pre-trained on 100GB of Japanese text
- Achieves 97% accuracy on livedoor news classification
- JSQuAD performance: EM=0.900, F1=0.945
- Implements T5 architecture with Japanese-specific optimizations
Core Capabilities
- Text classification with high accuracy (97% on news classification)
- Question answering (JSQuAD benchmark)
- Text generation and sequence-to-sequence tasks
- Feature extraction for Japanese text
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its specialized Japanese language capabilities and improved efficiency, offering better performance than multilingual alternatives with a smaller parameter count. It's particularly notable for achieving 6 percentage points higher accuracy than mT5 on news classification tasks.
Q: What are the recommended use cases?
The model is well-suited for Japanese text classification, question answering, and sequence-to-sequence tasks. However, it requires task-specific fine-tuning before deployment. Users should be aware of potential biases in the training data and ensure ethical usage.