Doge-160M-Instruct

Doge-160M-Instruct

SmallDoge

Doge-160M-Instruct is a 160M parameter language model using Dynamic Mask Attention and Cross Domain Mixture of Experts, trained on SmolTalk and UltraFeedback datasets for instruction following.

PropertyValue
Parameter Count160M
Model TypeInstruction-tuned Language Model
ArchitectureDynamic Mask Attention with Cross Domain MoE
PaperWonderful Matrices (2024)
Training DataSmolTalk (SFT), UltraFeedback Binarized (DPO)

What is Doge-160M-Instruct?

Doge-160M-Instruct is an innovative language model that combines Dynamic Mask Attention and Cross Domain Mixture of Experts to achieve efficient performance. It represents a significant advancement in compact language models, achieving impressive results across various benchmarks while maintaining computational efficiency.

Implementation Details

The model was developed through a two-stage training process: Supervised Fine-Tuning (SFT) on SmolTalk dataset followed by Direct Preference Optimization (DPO) on UltraFeedback Binarized. The training utilized bfloat16 precision with specific learning rates of 4e-4 for SFT and 4e-5 for DPO phases.

  • Dynamic Mask Attention allows switching between self-attention (training) and state space (inference)
  • Cross Domain Mixture of Experts inherits weights from Multi-Layer Perceptron
  • 2048 token context length for SFT, 1024 for DPO
  • Batch size of 0.25M for SFT and 0.125M for DPO

Core Capabilities

  • Achieves 16.8% on IFEval Prompt Strict Accuracy
  • 29.7% performance on MMLU benchmark
  • 42.8% accuracy on ARC tasks
  • 64.1% on PIQA evaluations
  • Processing speed of 28 tokens/s on i7-11 CPU

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its Dynamic Mask Attention mechanism that enables flexible switching between attention modes and its efficient Cross Domain Mixture of Experts architecture, allowing for strong performance despite its relatively small size.

Q: What are the recommended use cases?

The model is particularly well-suited for instruction-following tasks, general language understanding, and applications requiring a balance between performance and computational efficiency. It's especially valuable in scenarios where resource constraints are important considerations.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026