OCRonos-Vintage
Property | Value |
---|---|
Parameter Count | 124M |
Model Type | GPT-2 |
License | Apache 2.0 |
Training Data | 18B tokens from cultural heritage archives |
Context Window | 1,024 tokens |
What is OCRonos-Vintage?
OCRonos-Vintage is a specialized language model designed for OCR correction of historical texts, particularly those from cultural heritage archives. Pre-trained from scratch on 18 billion tokens from sources like the Library of Congress, Internet Archive, and Hathi Trust, this model focuses on content published before December 29th, 1955, with the majority dating between 1880 and 1920.
Implementation Details
The model was trained using llm.c on 4 H100s for two and a half hours, completing 9060 steps over 2 epochs. It uses the GPT-2 tokenizer and can process text efficiently on both CPU and GPU, achieving speeds of over 10,000 tokens per second.
- Lightweight architecture at 124M parameters
- BF16 tensor type for efficient processing
- 1,024 token context window
- Trained on the Jean Zay H100 cluster
Core Capabilities
- High-quality OCR correction for historical English texts
- Performance comparable to GPT-4 for cultural archive correction
- Historical text generation with period-appropriate content
- Efficient processing on both CPU and GPU
Frequently Asked Questions
Q: What makes this model unique?
OCRonos-Vintage is unique in its specialized focus on historical text correction, being trained exclusively on pre-1955 content from cultural heritage archives. It's also fully open, with open code, weights, and training data in permissible license.
Q: What are the recommended use cases?
The model excels at correcting OCR errors in English-language texts from the mid-19th to mid-20th centuries. It can also be used for generating period-appropriate historical text, though it may struggle with modern concepts due to its temporal training constraints.