KR-BERT-char16424

Maintained By
snunlp

KR-BERT-char16424

PropertyValue
Parameters99,265,066
Vocabulary Size16,424
PaperKR-BERT: A Small-Scale Korean-Specific Language Model
Training Data2.47GB (20M sentences, 233M words)
MLM Accuracy0.779

What is KR-BERT-char16424?

KR-BERT-char16424 is a specialized Korean language model developed by Seoul National University's Computational Linguistics Lab. It's designed to be more efficient than multilingual BERT while maintaining high performance on Korean language tasks. The model features a character-based approach with BidirectionalWordPiece tokenization, specifically optimized for Korean language structure.

Implementation Details

The model implements a novel BidirectionalWordPiece tokenization strategy that applies BPE in both forward and backward directions, choosing the higher frequency option. This approach has shown superior performance in handling Korean-specific linguistic features, achieving a masked language modeling accuracy of 0.779, outperforming KoBERT (0.750).

  • Supports both character and sub-character tokenization modes
  • Implements efficient vocabulary size of 16,424 tokens
  • Trained on 2.47GB of Korean text data
  • Compatible with both PyTorch and TensorFlow frameworks

Core Capabilities

  • Advanced Korean text tokenization using BidirectionalWordPiece
  • High performance on sentiment analysis (89.38% accuracy on NSMC)
  • Efficient handling of Korean-specific linguistic features
  • Support for both character and sub-character level processing

Frequently Asked Questions

Q: What makes this model unique?

KR-BERT's unique BidirectionalWordPiece tokenization and specialized Korean language focus make it more efficient than multilingual alternatives while maintaining high performance. Its dual character/sub-character support provides flexibility in handling Korean text processing tasks.

Q: What are the recommended use cases?

The model excels in Korean sentiment analysis, demonstrated by its strong performance on the NSMC dataset. It's particularly suitable for tasks requiring deep understanding of Korean language structure, including text classification, sentiment analysis, and general Korean language processing tasks.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.