Pythia-160M
Property | Value |
---|---|
Parameter Count | 162M (85M non-embedding) |
Model Type | Transformer-based Language Model |
Architecture | 12 layers, 768 dimension, 12 attention heads |
License | Apache 2.0 |
Paper | Link |
What is pythia-160m?
Pythia-160M is part of EleutherAI's Pythia Scaling Suite, a collection of models specifically developed to facilitate interpretability research in language models. This particular model contains 162M parameters and was trained on The Pile dataset, making it comparable to models like GPT-Neo 125M and OPT-125M in size and architecture.
Implementation Details
The model employs a transformer-based architecture with 12 layers, a model dimension of 768, and 12 attention heads. It was trained with a batch size of 2M tokens and a learning rate of 6.0x10^-4, using the GPT-NeoX framework.
- Training utilized The Pile dataset without deduplication
- Includes 154 checkpoints throughout training
- Implements Flash Attention for improved performance
- Uses the same tokenizer as GPT-NeoX-20B
Core Capabilities
- English language text generation
- Research-focused model design
- Supports scientific experimentation on language model behavior
- Compatible with Hugging Face Transformers library
Frequently Asked Questions
Q: What makes this model unique?
This model is part of a carefully controlled experimental suite where all models are trained on identical data in the same order, making it ideal for studying language model behavior and interpretability research.
Q: What are the recommended use cases?
Pythia-160M is primarily intended for research purposes, particularly in studying model behavior and interpretability. It is not recommended for deployment in production environments or direct human-facing applications without fine-tuning and proper risk assessment.