Instella-3B-Stage1
Property | Value |
---|---|
Parameter Count | 3.11B |
Training Tokens | 4.065T |
License | ResearchRAIL |
Architecture | 36 layers, 32 attention heads |
Context Length | 4096 tokens |
Model Type | Causal Language Model |
What is Instella-3B-Stage1?
Instella-3B-Stage1 is AMD's groundbreaking first-stage pre-trained language model, developed as part of their commitment to advancing open-source AI. Trained on AMD Instinct™ MI300X GPUs, this model represents the initial phase of a sophisticated multi-stage training approach, establishing strong foundations in natural language understanding.
Implementation Details
The model is built with cutting-edge architecture featuring 36 decoder layers and 32 attention heads, with a model hidden size of 2560 and MLP hidden size of 13824. It utilizes advanced training techniques including FlashAttention-2, Torch Compile, and Fully Sharded Data Parallelism (FSDP) with hybrid sharding.
- Trained using AdamW optimizer with peak learning rate of 4.0e-4
- Implements cosine learning rate scheduler with warmup
- Uses bfloat16 mixed-precision training
- Supports context length of up to 4,096 tokens
Core Capabilities
- Outperforms existing fully open models across multiple benchmarks
- Achieves 61.33% average score on standard benchmarks
- Excels in ARC Challenge (53.85%) and ARC Easy (73.16%)
- Strong performance in knowledge-intensive tasks
Frequently Asked Questions
Q: What makes this model unique?
Instella-3B-Stage1 stands out for its exceptional performance despite being fully open-source, trained on AMD's MI300X GPUs, and achieving competitive results with significantly fewer training tokens compared to similar models.
Q: What are the recommended use cases?
The model is designed for research purposes and excels in tasks requiring natural language understanding, including question answering, reasoning, and knowledge-intensive applications. However, it's not recommended for safety-critical or medical applications.