Llama-3.1-8b-instruct_4bitgs64_hqq_calib
Property | Value |
---|---|
License | Llama 3.1 |
Model Size | 8B parameters |
VRAM Usage | 6.1GB |
Quantization | 4-bit (group-size=64) |
What is Llama-3.1-8b-instruct_4bitgs64_hqq_calib?
This is a highly optimized version of Meta's Llama 3.1 8B model, utilizing HQQ (High-Quality Quantization) technology to achieve impressive compression while maintaining near-original performance. The model features 4-bit quantization with a group size of 64, resulting in significant memory savings while preserving 99.3% of the original model's capabilities.
Implementation Details
The model employs advanced quantization techniques achieving a 4.5 bitrate for linear layers, requiring only 6.1GB of VRAM compared to the original 15.7GB. It demonstrates superior decoding speeds of 125 tokens/sec for short sequences and 97 tokens/sec for longer sequences on an RTX 3090.
- Calibrated quantization for optimal performance
- Supports multiple backend options (PyTorch, torchao_int4, bitblas)
- Compatible with torch 2.4.0 or newer with CUDA 12.1
- Implements efficient memory management techniques
Core Capabilities
- ARC (25-shot): 60.92% accuracy
- HellaSwag (10-shot): 79.52% accuracy
- MMLU (5-shot): 67.74% accuracy
- GSM8K (5-shot): 75.36% accuracy
- Maintains 99.3% relative performance compared to FP16 model
Frequently Asked Questions
Q: What makes this model unique?
This model stands out due to its exceptional balance between compression and performance. It achieves significantly reduced memory usage while maintaining almost identical performance to the original model, with notably faster decoding speeds compared to alternatives like AWQ and GPTQ.
Q: What are the recommended use cases?
The model is ideal for deployment scenarios where memory efficiency is crucial but performance cannot be compromised. It's particularly well-suited for text generation tasks, including essay writing, question-answering, and creative content generation, especially in resource-constrained environments.