sparsetral-16x7B-v2
Property | Value |
---|---|
Parameter Count | 9.39B |
License | Apache 2.0 |
Base Model | Mistral-7B-Instruct-v0.2 |
Training Data | OpenHermes-2.5 |
Paper | View Research Paper |
What is sparsetral-16x7B-v2?
sparsetral-16x7B-v2 is an advanced language model that implements Mixture-of-Experts (MoE) architecture with 16 experts on top of the Mistral-7B base model. It represents a significant advancement in parameter-efficient fine-tuning, utilizing QLoRA and MoE adapters to enhance performance while maintaining computational efficiency.
Implementation Details
The model was trained using a forked version of unsloth for efficient training, utilizing 8 A6000 GPUs. Key technical specifications include a sequence length of 4096, effective batch size of 128, and learning rate of 2e-5 with linear decay. The implementation features 16 experts with top-k routing (k=4) and adapter dimension of 512.
- QLoRA training with rank 64 and alpha 16
- MoE adapters and routers trained in bf16 format
- Custom prompt format using im_start and im_end tokens
- Optimized for 4096 token context window
Core Capabilities
- Advanced text generation with improved parameter efficiency
- Optimized for instruction-following tasks
- Enhanced conversational abilities
- Efficient routing between expert networks
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its parameter-efficient sparsity crafting approach, combining MoE architecture with QLoRA fine-tuning. This allows for enhanced performance while maintaining reasonable computational requirements.
Q: What are the recommended use cases?
The model is particularly well-suited for conversational AI applications, instruction-following tasks, and general text generation scenarios where efficient parameter usage is crucial.