GuwenBERT Base Model
Property | Value |
---|---|
License | Apache-2.0 |
Architecture | RoBERTa |
Training Data | 1.7B Classical Chinese Characters |
Vocabulary Size | 23,292 tokens |
What is guwenbert-base?
GuwenBERT is a specialized language model designed specifically for processing Classical Chinese texts. Built on the RoBERTa architecture, it has been extensively pre-trained on the daizhige dataset, comprising 15,694 ancient Chinese books covering diverse topics including Buddhism, Confucianism, Medicine, and History.
Implementation Details
The model employs a unique two-step training strategy, initialized from the chinese-roberta-wwm-ext base. Training was conducted on 4 V100 GPUs for 120K steps, using a batch size of 2,048 and sequence length of 512. The model features an Adam optimizer with a 2e-4 learning rate and 5K warmup steps.
- Comprehensive vocabulary of 23,292 tokens
- Trained on 1.7B classical Chinese characters
- 76% of training data is punctuated
- Converts traditional to simplified characters
Core Capabilities
- Named Entity Recognition for ancient texts
- Fill-mask prediction for Classical Chinese
- Sentence breaking and punctuation
- Achieves 84.63% F1 score in NER tasks
Frequently Asked Questions
Q: What makes this model unique?
GuwenBERT's distinctive feature is its specialization in Classical Chinese text processing, trained on an extensive corpus of ancient texts spanning multiple domains. Its two-step training strategy ensures optimal performance for classical text understanding.
Q: What are the recommended use cases?
The model excels in tasks related to Classical Chinese text processing, including named entity recognition (particularly for book names and other proper nouns), text completion, and automated punctuation. It's particularly valuable for digital humanities and historical text analysis.