Llama-3.1-8b-instruct_4bitgs64_hqq_calib

Maintained By
mobiuslabsgmbh

Llama-3.1-8b-instruct_4bitgs64_hqq_calib

PropertyValue
LicenseLlama 3.1
Model Size8B parameters
VRAM Usage6.1GB
Quantization4-bit (group-size=64)

What is Llama-3.1-8b-instruct_4bitgs64_hqq_calib?

This is a highly optimized version of Meta's Llama 3.1 8B model, utilizing HQQ (High-Quality Quantization) technology to achieve impressive compression while maintaining near-original performance. The model features 4-bit quantization with a group size of 64, resulting in significant memory savings while preserving 99.3% of the original model's capabilities.

Implementation Details

The model employs advanced quantization techniques achieving a 4.5 bitrate for linear layers, requiring only 6.1GB of VRAM compared to the original 15.7GB. It demonstrates superior decoding speeds of 125 tokens/sec for short sequences and 97 tokens/sec for longer sequences on an RTX 3090.

  • Calibrated quantization for optimal performance
  • Supports multiple backend options (PyTorch, torchao_int4, bitblas)
  • Compatible with torch 2.4.0 or newer with CUDA 12.1
  • Implements efficient memory management techniques

Core Capabilities

  • ARC (25-shot): 60.92% accuracy
  • HellaSwag (10-shot): 79.52% accuracy
  • MMLU (5-shot): 67.74% accuracy
  • GSM8K (5-shot): 75.36% accuracy
  • Maintains 99.3% relative performance compared to FP16 model

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its exceptional balance between compression and performance. It achieves significantly reduced memory usage while maintaining almost identical performance to the original model, with notably faster decoding speeds compared to alternatives like AWQ and GPTQ.

Q: What are the recommended use cases?

The model is ideal for deployment scenarios where memory efficiency is crucial but performance cannot be compromised. It's particularly well-suited for text generation tasks, including essay writing, question-answering, and creative content generation, especially in resource-constrained environments.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.