GuwenBERT Base Model

Property	Value
License	Apache-2.0
Architecture	RoBERTa
Training Data	1.7B Classical Chinese Characters
Vocabulary Size	23,292 tokens

What is guwenbert-base?

GuwenBERT is a specialized language model designed specifically for processing Classical Chinese texts. Built on the RoBERTa architecture, it has been extensively pre-trained on the daizhige dataset, comprising 15,694 ancient Chinese books covering diverse topics including Buddhism, Confucianism, Medicine, and History.

Implementation Details

The model employs a unique two-step training strategy, initialized from the chinese-roberta-wwm-ext base. Training was conducted on 4 V100 GPUs for 120K steps, using a batch size of 2,048 and sequence length of 512. The model features an Adam optimizer with a 2e-4 learning rate and 5K warmup steps.

Comprehensive vocabulary of 23,292 tokens
Trained on 1.7B classical Chinese characters
76% of training data is punctuated
Converts traditional to simplified characters

Core Capabilities

Named Entity Recognition for ancient texts
Fill-mask prediction for Classical Chinese
Sentence breaking and punctuation
Achieves 84.63% F1 score in NER tasks

Frequently Asked Questions

Q: What makes this model unique?

GuwenBERT's distinctive feature is its specialization in Classical Chinese text processing, trained on an extensive corpus of ancient texts spanning multiple domains. Its two-step training strategy ensures optimal performance for classical text understanding.

Q: What are the recommended use cases?

The model excels in tasks related to Classical Chinese text processing, including named entity recognition (particularly for book names and other proper nouns), text completion, and automated punctuation. It's particularly valuable for digital humanities and historical text analysis.

guwenbert-base