OCRonos-Vintage

Maintained By
PleIAs

OCRonos-Vintage

PropertyValue
Parameter Count124M
Model TypeGPT-2
LicenseApache 2.0
Training Data18B tokens from cultural heritage archives
Context Window1,024 tokens

What is OCRonos-Vintage?

OCRonos-Vintage is a specialized language model designed for OCR correction of historical texts, particularly those from cultural heritage archives. Pre-trained from scratch on 18 billion tokens from sources like the Library of Congress, Internet Archive, and Hathi Trust, this model focuses on content published before December 29th, 1955, with the majority dating between 1880 and 1920.

Implementation Details

The model was trained using llm.c on 4 H100s for two and a half hours, completing 9060 steps over 2 epochs. It uses the GPT-2 tokenizer and can process text efficiently on both CPU and GPU, achieving speeds of over 10,000 tokens per second.

  • Lightweight architecture at 124M parameters
  • BF16 tensor type for efficient processing
  • 1,024 token context window
  • Trained on the Jean Zay H100 cluster

Core Capabilities

  • High-quality OCR correction for historical English texts
  • Performance comparable to GPT-4 for cultural archive correction
  • Historical text generation with period-appropriate content
  • Efficient processing on both CPU and GPU

Frequently Asked Questions

Q: What makes this model unique?

OCRonos-Vintage is unique in its specialized focus on historical text correction, being trained exclusively on pre-1955 content from cultural heritage archives. It's also fully open, with open code, weights, and training data in permissible license.

Q: What are the recommended use cases?

The model excels at correcting OCR errors in English-language texts from the mid-19th to mid-20th centuries. It can also be used for generating period-appropriate historical text, though it may struggle with modern concepts due to its temporal training constraints.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.