guwenbert-large

Maintained By
ethanyt

GuwenBERT-large

PropertyValue
LicenseApache 2.0
FrameworkPyTorch
Task TypeFill-Mask
Training Data1.7B Classical Chinese Characters

What is guwenbert-large?

GuwenBERT-large is a specialized RoBERTa-based language model designed specifically for processing Classical Chinese texts. Developed by researchers at Datahammer and Beijing Institute of Technology, this model represents a significant advancement in ancient Chinese text analysis, leveraging a massive dataset of 15,694 classical books covering various domains including Buddhism, Confucianism, Medicine, and History.

Implementation Details

The model employs a sophisticated two-step training strategy, initializing from hfl/chinese-roberta-wwm-ext-large. Training occurred on 4 V100 GPUs for 120K steps, using a batch size of 2,048 and sequence length of 512. The model features a vocabulary size of 23,292 characters and utilizes the Adam optimizer with carefully tuned parameters.

  • Trained on 1.7B characters from classical Chinese texts
  • Two-phase training approach with selective parameter updating
  • Optimized for classical Chinese language understanding
  • Supports traditional to simplified character conversion

Core Capabilities

  • Named Entity Recognition with 84.63% F1 score
  • Book name recognition (75.57% F1 score)
  • Other name entity recognition (87.55% F1 score)
  • Sentence breaking and punctuation
  • Classical Chinese text analysis

Frequently Asked Questions

Q: What makes this model unique?

GuwenBERT-large's uniqueness lies in its specialized focus on Classical Chinese texts and its comprehensive training data covering multiple historical domains. The two-step training strategy and extensive vocabulary make it particularly effective for ancient Chinese text analysis.

Q: What are the recommended use cases?

The model excels in tasks such as named entity recognition in classical texts, sentence breaking, punctuation restoration, and general classical Chinese text understanding. It's particularly useful for digital humanities projects, historical text analysis, and classical Chinese literature research.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.