GuwenBERT-large
Property | Value |
---|---|
License | Apache 2.0 |
Framework | PyTorch |
Task Type | Fill-Mask |
Training Data | 1.7B Classical Chinese Characters |
What is guwenbert-large?
GuwenBERT-large is a specialized RoBERTa-based language model designed specifically for processing Classical Chinese texts. Developed by researchers at Datahammer and Beijing Institute of Technology, this model represents a significant advancement in ancient Chinese text analysis, leveraging a massive dataset of 15,694 classical books covering various domains including Buddhism, Confucianism, Medicine, and History.
Implementation Details
The model employs a sophisticated two-step training strategy, initializing from hfl/chinese-roberta-wwm-ext-large. Training occurred on 4 V100 GPUs for 120K steps, using a batch size of 2,048 and sequence length of 512. The model features a vocabulary size of 23,292 characters and utilizes the Adam optimizer with carefully tuned parameters.
- Trained on 1.7B characters from classical Chinese texts
- Two-phase training approach with selective parameter updating
- Optimized for classical Chinese language understanding
- Supports traditional to simplified character conversion
Core Capabilities
- Named Entity Recognition with 84.63% F1 score
- Book name recognition (75.57% F1 score)
- Other name entity recognition (87.55% F1 score)
- Sentence breaking and punctuation
- Classical Chinese text analysis
Frequently Asked Questions
Q: What makes this model unique?
GuwenBERT-large's uniqueness lies in its specialized focus on Classical Chinese texts and its comprehensive training data covering multiple historical domains. The two-step training strategy and extensive vocabulary make it particularly effective for ancient Chinese text analysis.
Q: What are the recommended use cases?
The model excels in tasks such as named entity recognition in classical texts, sentence breaking, punctuation restoration, and general classical Chinese text understanding. It's particularly useful for digital humanities projects, historical text analysis, and classical Chinese literature research.