t5-base-japanese

t5-base-japanese

sonoisa

Japanese T5 model pre-trained on 100GB corpus (Wikipedia, OSCAR, CC-100). 222M parameters, outperforms mT5 on news classification with 97% accuracy.

PropertyValue
Parameters222M
LicenseCC-BY SA 4.0
Training DataWikipedia, OSCAR, CC-100
FrameworkPyTorch

What is t5-base-japanese?

t5-base-japanese is a specialized Text-to-Text Transfer Transformer (T5) model pre-trained specifically for Japanese language tasks. Developed by sonoisa, this model leverages approximately 100GB of Japanese text from diverse sources including Wikipedia, OSCAR corpus, and CC-100 dataset. The model demonstrates superior performance compared to multilingual alternatives, particularly in tasks like news classification.

Implementation Details

The model utilizes a SentencePiece tokenizer trained on the complete Japanese Wikipedia dataset. With 222M parameters, it's 25% smaller than Google's mT5-small while achieving better performance. The model requires fine-tuning for specific downstream tasks but provides strong baseline performance.

  • Pre-trained on 100GB of Japanese text
  • Achieves 97% accuracy on livedoor news classification
  • JSQuAD performance: EM=0.900, F1=0.945
  • Implements T5 architecture with Japanese-specific optimizations

Core Capabilities

  • Text classification with high accuracy (97% on news classification)
  • Question answering (JSQuAD benchmark)
  • Text generation and sequence-to-sequence tasks
  • Feature extraction for Japanese text

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized Japanese language capabilities and improved efficiency, offering better performance than multilingual alternatives with a smaller parameter count. It's particularly notable for achieving 6 percentage points higher accuracy than mT5 on news classification tasks.

Q: What are the recommended use cases?

The model is well-suited for Japanese text classification, question answering, and sequence-to-sequence tasks. However, it requires task-specific fine-tuning before deployment. Users should be aware of potential biases in the training data and ensure ethical usage.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026