GuwenBERT-large

Property	Value
License	Apache 2.0
Framework	PyTorch
Task Type	Fill-Mask
Training Data	1.7B Classical Chinese Characters

What is guwenbert-large?

GuwenBERT-large is a specialized RoBERTa-based language model designed specifically for processing Classical Chinese texts. Developed by researchers at Datahammer and Beijing Institute of Technology, this model represents a significant advancement in ancient Chinese text analysis, leveraging a massive dataset of 15,694 classical books covering various domains including Buddhism, Confucianism, Medicine, and History.

Implementation Details

The model employs a sophisticated two-step training strategy, initializing from hfl/chinese-roberta-wwm-ext-large. Training occurred on 4 V100 GPUs for 120K steps, using a batch size of 2,048 and sequence length of 512. The model features a vocabulary size of 23,292 characters and utilizes the Adam optimizer with carefully tuned parameters.

Trained on 1.7B characters from classical Chinese texts
Two-phase training approach with selective parameter updating
Optimized for classical Chinese language understanding
Supports traditional to simplified character conversion

Core Capabilities

Named Entity Recognition with 84.63% F1 score
Book name recognition (75.57% F1 score)
Other name entity recognition (87.55% F1 score)
Sentence breaking and punctuation
Classical Chinese text analysis

Frequently Asked Questions

Q: What makes this model unique?

GuwenBERT-large's uniqueness lies in its specialized focus on Classical Chinese texts and its comprehensive training data covering multiple historical domains. The two-step training strategy and extensive vocabulary make it particularly effective for ancient Chinese text analysis.

Q: What are the recommended use cases?

The model excels in tasks such as named entity recognition in classical texts, sentence breaking, punctuation restoration, and general classical Chinese text understanding. It's particularly useful for digital humanities projects, historical text analysis, and classical Chinese literature research.

guwenbert-large