Instella-3B-Instruct
Property | Value |
---|---|
Parameter Count | 3.11B |
Context Length | 4096 tokens |
Architecture | 36 layers, 32 attention heads |
License | ResearchRAIL |
Training Tokens | 4.15 Trillion |
What is Instella-3B-Instruct?
Instella-3B-Instruct is AMD's latest instruction-tuned language model, developed as part of their commitment to open-source AI research. Trained on AMD Instinct MI300X GPUs, this model represents a significant advancement in fully open language models, achieving performance that rivals closed-source competitors while maintaining complete transparency in its development process.
Implementation Details
The model utilizes a transformer-based architecture with 36 decoder layers and 32 attention heads. It implements advanced training techniques including FlashAttention-2, Torch Compile, and Fully Sharded Data Parallelism (FSDP) with hybrid sharding. The training process involved multiple stages: pre-training (4.065T tokens), second-stage pre-training (57.575B tokens), supervised fine-tuning, and direct preference optimization (DPO).
- Vocabulary size of ~50,000 tokens using OLMo tokenizer
- 4,096 token context length
- bfloat16 mixed-precision training
- Trained across 128 Instinct MI300X GPUs
Core Capabilities
- Outperforms existing fully open models by 14.37% on average
- Competitive performance with Llama-3.2-3B and Qwen2.5-3B
- Strong performance in instruction following and multi-turn QA tasks
- Enhanced capabilities in mathematical reasoning and knowledge recall
Frequently Asked Questions
Q: What makes this model unique?
Instella-3B-Instruct stands out for being fully open-source while achieving performance comparable to closed-source models. It's trained using AMD's MI300X GPUs and implements state-of-the-art training techniques, making it a significant milestone in open AI development.
Q: What are the recommended use cases?
The model excels in instruction following, multi-turn QA tasks, and mathematical reasoning. However, it's intended for research purposes only and should not be used in safety-critical situations, health applications, or scenarios requiring high factual accuracy.