bert-mini-historic-multilingual-cased

Maintained By
dbmdz

bert-mini-historic-multilingual-cased

PropertyValue
Parameter Count11.55M
ArchitectureBERT Mini (4 layers, 256 hidden)
Training Speed10.5 sec/1,000 steps
LanguagesGerman, French, English, Finnish, Swedish
Training Data130GB total across languages

What is bert-mini-historic-multilingual-cased?

This is a compact multilingual BERT model specifically designed for processing historical texts. It's part of the Historic Language Models (HLMs) collection, trained on a carefully curated dataset of historical documents from Europeana and the British Library. The model represents a lighter alternative to the full-sized BERT, maintaining good performance while requiring significantly fewer computational resources.

Implementation Details

The model features 4 transformer layers with a hidden size of 256, resulting in 11.55M parameters. It was trained on a balanced corpus of historical texts: German (28GB), French (27GB), English (24GB), Finnish (27GB), and Swedish (27GB). The training data was filtered based on OCR confidence scores to ensure quality, with thresholds typically set at 0.6-0.7.

  • Utilizes a 32k subword vocabulary optimized for historical texts
  • Shows excellent subword fertility rates across languages (1.25-1.69)
  • Maintains very low unknown token rates (0.0-0.0007)
  • Training performed on TPU v3 architecture

Core Capabilities

  • Multilingual processing of historical texts from the 19th century
  • Efficient resource utilization with quick inference times
  • Balanced performance across five European languages
  • Specialized in handling historical document peculiarities

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized focus on historical texts while maintaining a compact architecture. It's specifically designed to handle the nuances and variations in 19th-century documents across multiple European languages, making it ideal for digital humanities and historical research applications.

Q: What are the recommended use cases?

The model is particularly suited for processing historical documents, especially those from the 19th century. It's ideal for tasks such as named entity recognition in historical texts, document classification, and other NLP tasks where computational efficiency is important and the text material is from historical sources.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.