OLMo-1B-0724-hf
Property | Value |
---|---|
Parameter Count | 1.28B |
Training Tokens | 3.05 Trillion |
License | Apache 2.0 |
Paper | arxiv:2402.00838 |
What is OLMo-1B-0724-hf?
OLMo-1B-0724-hf is the latest iteration of Allen AI's Open Language Model series, representing a significant improvement over its predecessor. This 1.28B parameter model is trained on an enhanced version of the Dolma dataset, featuring better deduplication and quality filtering. It demonstrates impressive performance gains, including a 4.4 point increase in HellaSwag benchmarks.
Implementation Details
The model employs a sophisticated architecture with 16 layers, 2048 hidden size, and 16 attention heads, supporting a context length of 4096 tokens. It utilizes staged training with two distinct phases: initial training on Dolma 1.7 dataset followed by fine-tuning on a higher-quality subset.
- Advanced staged training approach with cosine learning rate scheduling
- Optimized with AdamW optimizer (learning rate: 4.0E-4)
- Implements full attention mechanism with non-parametric LayerNorm
- Supports efficient quantization for improved inference speed
Core Capabilities
- Strong performance in multiple benchmarks (65.0 average score across standard tasks)
- Excels in tasks like SCIQ (93.4%) and PIQA (74.9%)
- Efficient text generation with support for various sampling parameters
- Native integration with HuggingFace Transformers library
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its open science approach, complete transparency in training data and process, and significant improvements through staged training methodology. It achieves competitive performance despite its relatively small size compared to larger models.
Q: What are the recommended use cases?
The model is well-suited for general language modeling tasks, research applications, and fine-tuning for specific downstream tasks. It's particularly effective for tasks requiring strong language understanding and generation capabilities while maintaining computational efficiency.