guwenbert-base

Maintained By
ethanyt

GuwenBERT Base Model

PropertyValue
LicenseApache-2.0
ArchitectureRoBERTa
Training Data1.7B Classical Chinese Characters
Vocabulary Size23,292 tokens

What is guwenbert-base?

GuwenBERT is a specialized language model designed specifically for processing Classical Chinese texts. Built on the RoBERTa architecture, it has been extensively pre-trained on the daizhige dataset, comprising 15,694 ancient Chinese books covering diverse topics including Buddhism, Confucianism, Medicine, and History.

Implementation Details

The model employs a unique two-step training strategy, initialized from the chinese-roberta-wwm-ext base. Training was conducted on 4 V100 GPUs for 120K steps, using a batch size of 2,048 and sequence length of 512. The model features an Adam optimizer with a 2e-4 learning rate and 5K warmup steps.

  • Comprehensive vocabulary of 23,292 tokens
  • Trained on 1.7B classical Chinese characters
  • 76% of training data is punctuated
  • Converts traditional to simplified characters

Core Capabilities

  • Named Entity Recognition for ancient texts
  • Fill-mask prediction for Classical Chinese
  • Sentence breaking and punctuation
  • Achieves 84.63% F1 score in NER tasks

Frequently Asked Questions

Q: What makes this model unique?

GuwenBERT's distinctive feature is its specialization in Classical Chinese text processing, trained on an extensive corpus of ancient texts spanning multiple domains. Its two-step training strategy ensures optimal performance for classical text understanding.

Q: What are the recommended use cases?

The model excels in tasks related to Classical Chinese text processing, including named entity recognition (particularly for book names and other proper nouns), text completion, and automated punctuation. It's particularly valuable for digital humanities and historical text analysis.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.