Meta-Llama-3.1-8B-Instruct-quantized.w8a8

Property	Value
Model Size	8B parameters
License	Llama3.1
Release Date	July 11, 2024
Developer	Neural Magic
Model URL	neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8

What is Meta-Llama-3.1-8B-Instruct-quantized.w8a8?

This is a highly optimized version of Meta's Llama 3.1 8B instruction-tuned model, featuring INT8 quantization for both weights and activations. The model achieves remarkable efficiency gains with a 50% reduction in GPU memory usage and 2x increase in matrix multiplication throughput, while maintaining or even exceeding the original model's performance across various benchmarks.

Implementation Details

The model employs sophisticated quantization techniques, using the GPTQ algorithm with a 1% damping factor and processing 256 sequences of 8,192 random tokens. It implements symmetric static per-channel quantization for weights and symmetric dynamic per-token quantization for activations, specifically targeting linear operators within transformer blocks.

Weight and activation quantization from 16 to 8 bits
50% reduction in disk space and memory requirements
Maintains full model context length of 8192 tokens
Deployable using vLLM backend for efficient inference

Core Capabilities

Multilingual support with strong performance across 7 languages
Exceeds original model performance on Arena-Hard (105.4% recovery)
Strong coding capabilities with 67.1% pass@1 on HumanEval
Excellent performance on mathematical reasoning tasks (84.8% on GSM-8K)
Competitive results on multiple-choice tasks (81.7% on ARC Challenge)

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for achieving quantization without performance degradation, actually improving scores on some benchmarks while significantly reducing computational requirements. It's particularly notable for maintaining high performance across multiple languages and tasks.

Q: What are the recommended use cases?

The model is designed for commercial and research applications requiring assistant-like chat capabilities. It excels in multilingual contexts, mathematical reasoning, and coding tasks, making it suitable for a wide range of applications where computational efficiency is crucial.