sparsetral-16x7B-v2

Property	Value
Parameter Count	9.39B
License	Apache 2.0
Base Model	Mistral-7B-Instruct-v0.2
Training Data	OpenHermes-2.5
Paper	View Research Paper

What is sparsetral-16x7B-v2?

sparsetral-16x7B-v2 is an advanced language model that implements Mixture-of-Experts (MoE) architecture with 16 experts on top of the Mistral-7B base model. It represents a significant advancement in parameter-efficient fine-tuning, utilizing QLoRA and MoE adapters to enhance performance while maintaining computational efficiency.

Implementation Details

The model was trained using a forked version of unsloth for efficient training, utilizing 8 A6000 GPUs. Key technical specifications include a sequence length of 4096, effective batch size of 128, and learning rate of 2e-5 with linear decay. The implementation features 16 experts with top-k routing (k=4) and adapter dimension of 512.

QLoRA training with rank 64 and alpha 16
MoE adapters and routers trained in bf16 format
Custom prompt format using im_start and im_end tokens
Optimized for 4096 token context window

Core Capabilities

Advanced text generation with improved parameter efficiency
Optimized for instruction-following tasks
Enhanced conversational abilities
Efficient routing between expert networks

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its parameter-efficient sparsity crafting approach, combining MoE architecture with QLoRA fine-tuning. This allows for enhanced performance while maintaining reasonable computational requirements.

Q: What are the recommended use cases?

The model is particularly well-suited for conversational AI applications, instruction-following tasks, and general text generation scenarios where efficient parameter usage is crucial.