sparsetral-16x7B-v2

Maintained By
serpdotai

sparsetral-16x7B-v2

PropertyValue
Parameter Count9.39B
LicenseApache 2.0
Base ModelMistral-7B-Instruct-v0.2
Training DataOpenHermes-2.5
PaperView Research Paper

What is sparsetral-16x7B-v2?

sparsetral-16x7B-v2 is an advanced language model that implements Mixture-of-Experts (MoE) architecture with 16 experts on top of the Mistral-7B base model. It represents a significant advancement in parameter-efficient fine-tuning, utilizing QLoRA and MoE adapters to enhance performance while maintaining computational efficiency.

Implementation Details

The model was trained using a forked version of unsloth for efficient training, utilizing 8 A6000 GPUs. Key technical specifications include a sequence length of 4096, effective batch size of 128, and learning rate of 2e-5 with linear decay. The implementation features 16 experts with top-k routing (k=4) and adapter dimension of 512.

  • QLoRA training with rank 64 and alpha 16
  • MoE adapters and routers trained in bf16 format
  • Custom prompt format using im_start and im_end tokens
  • Optimized for 4096 token context window

Core Capabilities

  • Advanced text generation with improved parameter efficiency
  • Optimized for instruction-following tasks
  • Enhanced conversational abilities
  • Efficient routing between expert networks

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its parameter-efficient sparsity crafting approach, combining MoE architecture with QLoRA fine-tuning. This allows for enhanced performance while maintaining reasonable computational requirements.

Q: What are the recommended use cases?

The model is particularly well-suited for conversational AI applications, instruction-following tasks, and general text generation scenarios where efficient parameter usage is crucial.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.