Cerebras-GPT-13B

Property	Value
Parameter Count	13 Billion
License	Apache 2.0
Paper	arXiv Paper
Training Data	The Pile
Context Length	2048 tokens

What is Cerebras-GPT-13B?

Cerebras-GPT-13B is a state-of-the-art language model developed by Cerebras Systems as part of their GPT family. This model represents the largest in their series, trained according to Chinchilla scaling laws with 20 tokens per parameter. It was trained on the Andromeda AI supercomputer using 16 CS-2 wafer scale systems.

Implementation Details

The model features a GPT-3 style architecture with 40 layers, 5120 hidden dimensions, and 40 attention heads. It employs full attention mechanisms rather than sparse banded attention, and uses learned positional encodings. The model was trained using AdamW optimizer with carefully tuned hyperparameters and achieved 2.57E+11 training tokens.

Vocabulary Size: 50257 tokens
Training Batch Size: 720-1080 sequences
Learning Rate: 1.2E-04
Feed-forward dimension: 20480

Core Capabilities

Zero-shot and few-shot task performance
Strong performance on various benchmark tasks (e.g., 0.766 on PIQA, 0.696 on Lambada)
Text generation and completion
Feature extraction for downstream tasks

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its adherence to Chinchilla-optimal scaling laws and its training on the Andromeda AI supercomputer. It achieves strong performance while maintaining computational efficiency through optimal token-to-parameter ratios.

Q: What are the recommended use cases?

The model is primarily intended for research purposes in NLP, ethics, and alignment research. While it can be fine-tuned for specific applications, it's not recommended for direct deployment in production without additional safety measures and task-specific tuning.

Cerebras-GPT-13B

Cerebras-GPT-13B

What is Cerebras-GPT-13B?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models