Yuan2-M32-hf
Property | Value |
---|---|
Total Parameters | 40B |
Active Parameters | 3.7B |
Sequence Length | 16K |
Training Tokens | 2000B |
License | Apache 2.0 |
Paper | View Paper |
What is Yuan2-M32-hf?
Yuan2-M32-hf is a groundbreaking Mixture-of-Experts (MoE) language model that represents a significant advancement in efficient AI architecture. With 32 experts but only 2 active at any time, it achieves remarkable performance while using just 9.25% of the computation required by comparable dense models. The model implements a novel Attention Router mechanism that improves accuracy by 3.8% compared to traditional routing approaches.
Implementation Details
The model features a sophisticated architecture that enables it to process sequences up to 16K tokens in length. Its forward computation requires only 7.4 GFLOPS per token, which is approximately 1/19th of what Llama3-70B needs.
- Advanced Attention Router network for expert selection
- 32 total experts with 2 active experts per forward pass
- Only 3.7B active parameters out of 40B total
- Trained on 2000B tokens from scratch
Core Capabilities
- Outperforms Llama3-70B on MATH (55.9%) and ARC-Challenge (95.8%)
- Strong performance in coding (74.4% on HumanEval)
- Exceptional mathematical reasoning (92.7% on GSM8K)
- Robust general knowledge (72.2% on MMLU)
Frequently Asked Questions
Q: What makes this model unique?
The model's key innovation lies in its Attention Router and efficient MoE architecture, achieving state-of-the-art performance with significantly fewer active parameters and computational requirements than comparable models.
Q: What are the recommended use cases?
The model excels in coding tasks, mathematical reasoning, and complex problem-solving scenarios. It's particularly well-suited for applications requiring high performance with limited computational resources.