KR-BERT-char16424
Property | Value |
---|---|
Parameters | 99,265,066 |
Vocabulary Size | 16,424 |
Paper | KR-BERT: A Small-Scale Korean-Specific Language Model |
Training Data | 2.47GB (20M sentences, 233M words) |
MLM Accuracy | 0.779 |
What is KR-BERT-char16424?
KR-BERT-char16424 is a specialized Korean language model developed by Seoul National University's Computational Linguistics Lab. It's designed to be more efficient than multilingual BERT while maintaining high performance on Korean language tasks. The model features a character-based approach with BidirectionalWordPiece tokenization, specifically optimized for Korean language structure.
Implementation Details
The model implements a novel BidirectionalWordPiece tokenization strategy that applies BPE in both forward and backward directions, choosing the higher frequency option. This approach has shown superior performance in handling Korean-specific linguistic features, achieving a masked language modeling accuracy of 0.779, outperforming KoBERT (0.750).
- Supports both character and sub-character tokenization modes
- Implements efficient vocabulary size of 16,424 tokens
- Trained on 2.47GB of Korean text data
- Compatible with both PyTorch and TensorFlow frameworks
Core Capabilities
- Advanced Korean text tokenization using BidirectionalWordPiece
- High performance on sentiment analysis (89.38% accuracy on NSMC)
- Efficient handling of Korean-specific linguistic features
- Support for both character and sub-character level processing
Frequently Asked Questions
Q: What makes this model unique?
KR-BERT's unique BidirectionalWordPiece tokenization and specialized Korean language focus make it more efficient than multilingual alternatives while maintaining high performance. Its dual character/sub-character support provides flexibility in handling Korean text processing tasks.
Q: What are the recommended use cases?
The model excels in Korean sentiment analysis, demonstrated by its strong performance on the NSMC dataset. It's particularly suitable for tasks requiring deep understanding of Korean language structure, including text classification, sentiment analysis, and general Korean language processing tasks.