Llama-3.1-8b-instruct_4bitgs64_hqq_calib

Property	Value
License	Llama 3.1
Model Size	8B parameters
VRAM Usage	6.1GB
Quantization	4-bit (group-size=64)

What is Llama-3.1-8b-instruct_4bitgs64_hqq_calib?

This is a highly optimized version of Meta's Llama 3.1 8B model, utilizing HQQ (High-Quality Quantization) technology to achieve impressive compression while maintaining near-original performance. The model features 4-bit quantization with a group size of 64, resulting in significant memory savings while preserving 99.3% of the original model's capabilities.

Implementation Details

The model employs advanced quantization techniques achieving a 4.5 bitrate for linear layers, requiring only 6.1GB of VRAM compared to the original 15.7GB. It demonstrates superior decoding speeds of 125 tokens/sec for short sequences and 97 tokens/sec for longer sequences on an RTX 3090.

Calibrated quantization for optimal performance
Supports multiple backend options (PyTorch, torchao_int4, bitblas)
Compatible with torch 2.4.0 or newer with CUDA 12.1
Implements efficient memory management techniques

Core Capabilities

ARC (25-shot): 60.92% accuracy
HellaSwag (10-shot): 79.52% accuracy
MMLU (5-shot): 67.74% accuracy
GSM8K (5-shot): 75.36% accuracy
Maintains 99.3% relative performance compared to FP16 model

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its exceptional balance between compression and performance. It achieves significantly reduced memory usage while maintaining almost identical performance to the original model, with notably faster decoding speeds compared to alternatives like AWQ and GPTQ.

Q: What are the recommended use cases?

The model is ideal for deployment scenarios where memory efficiency is crucial but performance cannot be compromised. It's particularly well-suited for text generation tasks, including essay writing, question-answering, and creative content generation, especially in resource-constrained environments.