Cerebras-GPT-13B
Property | Value |
---|---|
Parameter Count | 13 Billion |
License | Apache 2.0 |
Paper | arXiv Paper |
Training Data | The Pile |
Context Length | 2048 tokens |
What is Cerebras-GPT-13B?
Cerebras-GPT-13B is a state-of-the-art language model developed by Cerebras Systems as part of their GPT family. This model represents the largest in their series, trained according to Chinchilla scaling laws with 20 tokens per parameter. It was trained on the Andromeda AI supercomputer using 16 CS-2 wafer scale systems.
Implementation Details
The model features a GPT-3 style architecture with 40 layers, 5120 hidden dimensions, and 40 attention heads. It employs full attention mechanisms rather than sparse banded attention, and uses learned positional encodings. The model was trained using AdamW optimizer with carefully tuned hyperparameters and achieved 2.57E+11 training tokens.
- Vocabulary Size: 50257 tokens
- Training Batch Size: 720-1080 sequences
- Learning Rate: 1.2E-04
- Feed-forward dimension: 20480
Core Capabilities
- Zero-shot and few-shot task performance
- Strong performance on various benchmark tasks (e.g., 0.766 on PIQA, 0.696 on Lambada)
- Text generation and completion
- Feature extraction for downstream tasks
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its adherence to Chinchilla-optimal scaling laws and its training on the Andromeda AI supercomputer. It achieves strong performance while maintaining computational efficiency through optimal token-to-parameter ratios.
Q: What are the recommended use cases?
The model is primarily intended for research purposes in NLP, ethics, and alignment research. While it can be fine-tuned for specific applications, it's not recommended for direct deployment in production without additional safety measures and task-specific tuning.