Sparse-Llama-3.1-8B-2of4
Property | Value |
---|---|
Parameter Count | 8.03B |
Model Type | Text Generation |
Architecture | Llama-3.1-8B with 2:4 Sparsity |
License | llama3.1 |
Developer | Neural Magic |
Research Paper | SparseGPT, SquareHead |
What is Sparse-Llama-3.1-8B-2of4?
Sparse-Llama-3.1-8B-2of4 is an optimized version of the Llama-3.1-8B model that implements an innovative 2:4 sparsity pattern, where two out of every four weights are strategically pruned while maintaining near-original performance. This model demonstrates impressive efficiency with 98.37% accuracy recovery on the OpenLLM benchmark and 97.3% on the Mosaic Eval Gauntlet.
Implementation Details
The model employs advanced optimization techniques combining SparseGPT and SquareHead approaches. It underwent pruning of all linear operators within transformer blocks, followed by knowledge distillation training for 13B tokens to recover accuracy. The implementation is specifically designed for efficient deployment using the vLLM backend.
- Utilizes 2:4 sparsity pattern in transformer blocks
- Trained with knowledge distillation for accuracy recovery
- Optimized for vLLM deployment
- BF16 tensor type for efficient computation
Core Capabilities
- Maintains 62.16 average score on OpenLLM benchmark
- Strong performance in GSM8K (56.3) and ARC-C (59.4)
- Effective language understanding with 69.0 score on relevant tasks
- Efficient deployment through vLLM with OpenAI-compatible serving
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its ability to maintain near-original performance while implementing a 2:4 sparsity pattern, effectively reducing computational requirements without significant accuracy loss.
Q: What are the recommended use cases?
This model is particularly suitable for deployment scenarios where efficiency is crucial while maintaining high accuracy. It's ideal for text generation tasks, especially in environments that can leverage vLLM backend optimization.