OCRonos-Vintage

Property	Value
Parameter Count	124M
Model Type	GPT-2
License	Apache 2.0
Training Data	18B tokens from cultural heritage archives
Context Window	1,024 tokens

What is OCRonos-Vintage?

OCRonos-Vintage is a specialized language model designed for OCR correction of historical texts, particularly those from cultural heritage archives. Pre-trained from scratch on 18 billion tokens from sources like the Library of Congress, Internet Archive, and Hathi Trust, this model focuses on content published before December 29th, 1955, with the majority dating between 1880 and 1920.

Implementation Details

The model was trained using llm.c on 4 H100s for two and a half hours, completing 9060 steps over 2 epochs. It uses the GPT-2 tokenizer and can process text efficiently on both CPU and GPU, achieving speeds of over 10,000 tokens per second.

Lightweight architecture at 124M parameters
BF16 tensor type for efficient processing
1,024 token context window
Trained on the Jean Zay H100 cluster

Core Capabilities

High-quality OCR correction for historical English texts
Performance comparable to GPT-4 for cultural archive correction
Historical text generation with period-appropriate content
Efficient processing on both CPU and GPU

Frequently Asked Questions

Q: What makes this model unique?

OCRonos-Vintage is unique in its specialized focus on historical text correction, being trained exclusively on pre-1955 content from cultural heritage archives. It's also fully open, with open code, weights, and training data in permissible license.

Q: What are the recommended use cases?

The model excels at correcting OCR errors in English-language texts from the mid-19th to mid-20th centuries. It can also be used for generating period-appropriate historical text, though it may struggle with modern concepts due to its temporal training constraints.

OCRonos-Vintage

OCRonos-Vintage

What is OCRonos-Vintage?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models