pko-t5-base
Property | Value |
---|---|
Parameter Count | 250M |
Model Type | T5 v1.1 Variant |
License | MIT |
Author | PAUST |
Model URL | https://huggingface.co/paust/pko-t5-base |
What is pko-t5-base?
pko-t5-base is a Korean-specific adaptation of the T5 v1.1 architecture, specifically trained on Korean language data including Namuwiki, Wikipedia, and Modern Korean Corpus. Unlike traditional T5 models, it employs BBPE tokenization instead of SentencePiece to eliminate Out-of-Vocabulary (OOV) issues in Korean text processing.
Implementation Details
The model utilizes unsupervised learning through T5's span corruption task, focusing exclusively on Korean language data. It's implemented using the Hugging Face Transformers library and requires the T5TokenizerFast tokenizer for optimal performance.
- Architecture: T5 v1.1 with BBPE tokenization
- Training Data: Korean-specific datasets (Namuwiki, Wikipedia, Modern Korean Corpus)
- Training Method: Unsupervised learning with span corruption
- Model Size: 250M parameters (base version)
Core Capabilities
- Strong performance on KLUE benchmark tasks
- Achieves 87.29% F1 score on YNAT after fine-tuning
- 97.28% LAS score on dependency parsing tasks
- 61.53% EM score on MRC tasks
- Supports both single-task and multi-task fine-tuning
Frequently Asked Questions
Q: What makes this model unique?
The model's use of BBPE tokenization instead of SentencePiece makes it particularly effective for Korean text processing, eliminating OOV issues common in Korean language models. It's specifically optimized for Korean language tasks through extensive pre-training on Korean-specific datasets.
Q: What are the recommended use cases?
The model is designed for fine-tuning on specific Korean language tasks. It performs particularly well on KLUE benchmark tasks including text classification, named entity recognition, semantic textual similarity, and machine reading comprehension after task-specific fine-tuning.