Meta-Llama-3.1-8B-Instruct-quantized.w8a8

Maintained By
neuralmagic

Meta-Llama-3.1-8B-Instruct-quantized.w8a8

PropertyValue
Model Size8B parameters
LicenseLlama3.1
Release DateJuly 11, 2024
DeveloperNeural Magic
Model URLneuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8

What is Meta-Llama-3.1-8B-Instruct-quantized.w8a8?

This is a highly optimized version of Meta's Llama 3.1 8B instruction-tuned model, featuring INT8 quantization for both weights and activations. The model achieves remarkable efficiency gains with a 50% reduction in GPU memory usage and 2x increase in matrix multiplication throughput, while maintaining or even exceeding the original model's performance across various benchmarks.

Implementation Details

The model employs sophisticated quantization techniques, using the GPTQ algorithm with a 1% damping factor and processing 256 sequences of 8,192 random tokens. It implements symmetric static per-channel quantization for weights and symmetric dynamic per-token quantization for activations, specifically targeting linear operators within transformer blocks.

  • Weight and activation quantization from 16 to 8 bits
  • 50% reduction in disk space and memory requirements
  • Maintains full model context length of 8192 tokens
  • Deployable using vLLM backend for efficient inference

Core Capabilities

  • Multilingual support with strong performance across 7 languages
  • Exceeds original model performance on Arena-Hard (105.4% recovery)
  • Strong coding capabilities with 67.1% pass@1 on HumanEval
  • Excellent performance on mathematical reasoning tasks (84.8% on GSM-8K)
  • Competitive results on multiple-choice tasks (81.7% on ARC Challenge)

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for achieving quantization without performance degradation, actually improving scores on some benchmarks while significantly reducing computational requirements. It's particularly notable for maintaining high performance across multiple languages and tasks.

Q: What are the recommended use cases?

The model is designed for commercial and research applications requiring assistant-like chat capabilities. It excels in multilingual contexts, mathematical reasoning, and coding tasks, making it suitable for a wide range of applications where computational efficiency is crucial.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.