Moonlight-16B-A3B-Instruct

Property	Value
Total Parameters	16B
Active Parameters	3B
Context Length	8K tokens
Training Tokens	5.7T
Model Type	Mixture-of-Experts (MoE)
Paper	arXiv:2502.16982

What is Moonlight-16B-A3B-Instruct?

Moonlight-16B-A3B-Instruct is an advanced language model that leverages the innovative Muon optimizer to achieve superior performance with significantly reduced computational requirements. As a Mixture-of-Experts model, it efficiently manages 16B total parameters while only activating 3B during inference, making it both powerful and computationally efficient.

Implementation Details

The model is built on groundbreaking improvements to the Muon optimizer, featuring two key technical innovations: enhanced weight decay implementation and consistent RMS updates across parameters. These improvements enable approximately 2x better sample efficiency compared to traditional Adam optimization.

Utilizes the same architecture as DeepSeek-V3
Supports popular inference engines like VLLM and SGLang
Implements ZeRO-1 style optimization for distributed training
Features 8K token context length

Core Capabilities

Achieves 70.0 on MMLU (English)
Scores 77.2 on C-Eval and 78.2 on CMML (Chinese)
Excels in code generation with 48.1 on HumanEval
Strong mathematical reasoning with 77.4 on GSM8K

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its use of the Muon optimizer, which enables it to achieve better performance than comparable models while using only about 52% of the training FLOPs. It also maintains strong performance across both English and Chinese tasks, making it truly multilingual.

Q: What are the recommended use cases?

The model excels in a wide range of applications including general language understanding, mathematical reasoning, code generation, and multilingual tasks. It's particularly suitable for applications requiring high performance with efficient computational resource usage.