guwenbert-large

guwenbert-large

ethanyt

GuwenBERT-large is a specialized RoBERTa model for Classical Chinese text processing, trained on 1.7B characters from ancient texts. Excels in tasks like NER with 84.63% F1 score.

PropertyValue
LicenseApache 2.0
FrameworkPyTorch
Task TypeFill-Mask
Training Data1.7B Classical Chinese Characters

What is guwenbert-large?

GuwenBERT-large is a specialized RoBERTa-based language model designed specifically for processing Classical Chinese texts. Developed by researchers at Datahammer and Beijing Institute of Technology, this model represents a significant advancement in ancient Chinese text analysis, leveraging a massive dataset of 15,694 classical books covering various domains including Buddhism, Confucianism, Medicine, and History.

Implementation Details

The model employs a sophisticated two-step training strategy, initializing from hfl/chinese-roberta-wwm-ext-large. Training occurred on 4 V100 GPUs for 120K steps, using a batch size of 2,048 and sequence length of 512. The model features a vocabulary size of 23,292 characters and utilizes the Adam optimizer with carefully tuned parameters.

  • Trained on 1.7B characters from classical Chinese texts
  • Two-phase training approach with selective parameter updating
  • Optimized for classical Chinese language understanding
  • Supports traditional to simplified character conversion

Core Capabilities

  • Named Entity Recognition with 84.63% F1 score
  • Book name recognition (75.57% F1 score)
  • Other name entity recognition (87.55% F1 score)
  • Sentence breaking and punctuation
  • Classical Chinese text analysis

Frequently Asked Questions

Q: What makes this model unique?

GuwenBERT-large's uniqueness lies in its specialized focus on Classical Chinese texts and its comprehensive training data covering multiple historical domains. The two-step training strategy and extensive vocabulary make it particularly effective for ancient Chinese text analysis.

Q: What are the recommended use cases?

The model excels in tasks such as named entity recognition in classical texts, sentence breaking, punctuation restoration, and general classical Chinese text understanding. It's particularly useful for digital humanities projects, historical text analysis, and classical Chinese literature research.

Related Models

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026