Sparse-Llama-3.1-8B-2of4

Property	Value
Parameter Count	8.03B
Model Type	Text Generation
Architecture	Llama-3.1-8B with 2:4 Sparsity
License	llama3.1
Developer	Neural Magic
Research Paper	SparseGPT, SquareHead

What is Sparse-Llama-3.1-8B-2of4?

Sparse-Llama-3.1-8B-2of4 is an optimized version of the Llama-3.1-8B model that implements an innovative 2:4 sparsity pattern, where two out of every four weights are strategically pruned while maintaining near-original performance. This model demonstrates impressive efficiency with 98.37% accuracy recovery on the OpenLLM benchmark and 97.3% on the Mosaic Eval Gauntlet.

Implementation Details

The model employs advanced optimization techniques combining SparseGPT and SquareHead approaches. It underwent pruning of all linear operators within transformer blocks, followed by knowledge distillation training for 13B tokens to recover accuracy. The implementation is specifically designed for efficient deployment using the vLLM backend.

Utilizes 2:4 sparsity pattern in transformer blocks
Trained with knowledge distillation for accuracy recovery
Optimized for vLLM deployment
BF16 tensor type for efficient computation

Core Capabilities

Maintains 62.16 average score on OpenLLM benchmark
Strong performance in GSM8K (56.3) and ARC-C (59.4)
Effective language understanding with 69.0 score on relevant tasks
Efficient deployment through vLLM with OpenAI-compatible serving

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its ability to maintain near-original performance while implementing a 2:4 sparsity pattern, effectively reducing computational requirements without significant accuracy loss.

Q: What are the recommended use cases?

This model is particularly suitable for deployment scenarios where efficiency is crucial while maintaining high accuracy. It's ideal for text generation tasks, especially in environments that can leverage vLLM backend optimization.