MiniMind2
Property | Value |
---|---|
Parameter Count | 26M-145M |
Model Type | Language Model (Chinese) |
Architecture | Transformer Decoder-Only |
License | Apache-2.0 |
Training Time | 2 hours on NVIDIA 3090 |
Model URL | https://huggingface.co/jingyaogong/MiniMind2 |
What is MiniMind2?
MiniMind2 is an ultra-lightweight Chinese language model series designed to make LLM training accessible to individual researchers and developers. With models ranging from just 26M to 145M parameters, it achieves remarkable performance while being trainable on a single consumer GPU in just 2 hours.
Implementation Details
The model implements a Transformer decoder-only architecture with several optimizations including RMSNorm pre-normalization, SwiGLU activation, and rotary positional embeddings (RoPE). It features both dense and mixture-of-experts (MoE) variants, with the latter incorporating expert routing for improved efficiency.
- Custom tokenizer with 6,400 tokens vocabulary
- Implemented in PyTorch with minimal dependencies
- Supports single/multi-GPU training via DDP and DeepSpeed
- Complete training pipeline including pretrain, SFT, LoRA, and DPO
Core Capabilities
- Basic conversation and knowledge-based Q&A
- Chinese language understanding and generation
- Limited English language capabilities
- Support for custom domain adaptation via LoRA
- Reasoning capabilities through optional R1 distillation
Frequently Asked Questions
Q: What makes this model unique?
MiniMind2's uniqueness lies in its extreme efficiency and accessibility. It demonstrates that meaningful language model capabilities can be achieved with minimal computational resources, making it possible for individual researchers to train models from scratch in just 2 hours for less than $0.50.
Q: What are the recommended use cases?
The model is ideal for research, educational purposes, and proof-of-concept deployments where resource constraints are significant. It's particularly suitable for learning LLM training fundamentals and experimenting with custom domain adaptation through LoRA fine-tuning.