Instella-3B-Stage1

Property	Value
Parameter Count	3.11B
Training Tokens	4.065T
License	ResearchRAIL
Architecture	36 layers, 32 attention heads
Context Length	4096 tokens
Model Type	Causal Language Model

What is Instella-3B-Stage1?

Instella-3B-Stage1 is AMD's groundbreaking first-stage pre-trained language model, developed as part of their commitment to advancing open-source AI. Trained on AMD Instinct™ MI300X GPUs, this model represents the initial phase of a sophisticated multi-stage training approach, establishing strong foundations in natural language understanding.

Implementation Details

The model is built with cutting-edge architecture featuring 36 decoder layers and 32 attention heads, with a model hidden size of 2560 and MLP hidden size of 13824. It utilizes advanced training techniques including FlashAttention-2, Torch Compile, and Fully Sharded Data Parallelism (FSDP) with hybrid sharding.

Trained using AdamW optimizer with peak learning rate of 4.0e-4
Implements cosine learning rate scheduler with warmup
Uses bfloat16 mixed-precision training
Supports context length of up to 4,096 tokens

Core Capabilities

Outperforms existing fully open models across multiple benchmarks
Achieves 61.33% average score on standard benchmarks
Excels in ARC Challenge (53.85%) and ARC Easy (73.16%)
Strong performance in knowledge-intensive tasks

Frequently Asked Questions

Q: What makes this model unique?

Instella-3B-Stage1 stands out for its exceptional performance despite being fully open-source, trained on AMD's MI300X GPUs, and achieving competitive results with significantly fewer training tokens compared to similar models.

Q: What are the recommended use cases?

The model is designed for research purposes and excels in tasks requiring natural language understanding, including question answering, reasoning, and knowledge-intensive applications. However, it's not recommended for safety-critical or medical applications.