Pythia-160M

Property	Value
Parameter Count	162M (85M non-embedding)
Model Type	Transformer-based Language Model
Architecture	12 layers, 768 dimension, 12 attention heads
License	Apache 2.0
Paper	Link

What is pythia-160m?

Pythia-160M is part of EleutherAI's Pythia Scaling Suite, a collection of models specifically developed to facilitate interpretability research in language models. This particular model contains 162M parameters and was trained on The Pile dataset, making it comparable to models like GPT-Neo 125M and OPT-125M in size and architecture.

Implementation Details

The model employs a transformer-based architecture with 12 layers, a model dimension of 768, and 12 attention heads. It was trained with a batch size of 2M tokens and a learning rate of 6.0x10^-4, using the GPT-NeoX framework.

Training utilized The Pile dataset without deduplication
Includes 154 checkpoints throughout training
Implements Flash Attention for improved performance
Uses the same tokenizer as GPT-NeoX-20B

Core Capabilities

English language text generation
Research-focused model design
Supports scientific experimentation on language model behavior
Compatible with Hugging Face Transformers library

Frequently Asked Questions

Q: What makes this model unique?

This model is part of a carefully controlled experimental suite where all models are trained on identical data in the same order, making it ideal for studying language model behavior and interpretability research.

Q: What are the recommended use cases?

Pythia-160M is primarily intended for research purposes, particularly in studying model behavior and interpretability. It is not recommended for deployment in production environments or direct human-facing applications without fine-tuning and proper risk assessment.

pythia-160m