bert-base-historic-multilingual-cased

Property	Value
Author	dbmdz
Training Data Size	130GB
Languages Supported	German, French, English, Finnish, Swedish
Vocabulary Size	32k tokens

What is bert-base-historic-multilingual-cased?

hmBERT is a specialized BERT model designed for processing historical multilingual texts. It has been trained on a massive corpus of 130GB of historical documents from five European languages, with data sourced primarily from Europeana and the British Library. The model is particularly optimized for Named Entity Recognition (NER) tasks in historical documents.

Implementation Details

The model was trained using Google's TPU Research Cloud on a v3-32 TPU setup. Training utilized a batch size of 128 and ran for 3,000,000 steps with a learning rate of 1e-4. The training data was carefully filtered based on OCR confidence scores to ensure quality, with different thresholds applied for each language corpus.

German corpus: 28GB (OCR confidence > 0.60)
French corpus: 27GB (OCR confidence > 0.70)
English corpus: 24GB (filtered for years 1800-1900)
Finnish corpus: 27GB (upsampled from 1.2GB)
Swedish corpus: 27GB (upsampled from 1.1GB)

Core Capabilities

Multilingual processing of historical texts
Named Entity Recognition for historical documents
Low unknown token rates across all supported languages
Optimized subword tokenization with fertility rates between 1.16-1.69

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically designed for historical text analysis, with training data carefully selected from historical sources and preprocessed to maintain quality through OCR confidence filtering. It's one of the few models specifically optimized for historical NER tasks across multiple European languages.

Q: What are the recommended use cases?

The model is ideal for processing historical documents from the 19th century, particularly for tasks like Named Entity Recognition. It's especially useful for digital humanities projects, historical research, and automated processing of historical archives in German, French, English, Finnish, and Swedish.