Meta-Llama-3-8B-Instruct-quantized.w8a8
Property | Value |
---|---|
Author | Neural Magic |
License | Llama3 |
Release Date | 7/11/2024 |
Model Type | Quantized Language Model |
Base Architecture | Meta-Llama-3 |
What is Meta-Llama-3-8B-Instruct-quantized.w8a8?
This is a quantized version of Meta-Llama-3-8B-Instruct, specifically optimized for efficient deployment while maintaining performance. The model employs INT8 quantization for both weights and activations, resulting in approximately 50% reduced GPU memory requirements and 2x increased matrix-multiply compute throughput.
Implementation Details
The model utilizes advanced quantization techniques including GPTQ algorithm with a 1% damping factor, processing 256 sequences of 8,192 random tokens. The quantization is applied specifically to linear operators within transformer blocks, using symmetric static per-channel scheme for weights and symmetric dynamic per-token scheme for activations.
- Weight quantization: INT8 data type with per-channel scaling
- Activation quantization: INT8 with dynamic per-token scaling
- 50% reduction in disk size requirements
- Maintains 68.66 average score on OpenLLM benchmark (v1)
Core Capabilities
- Efficient deployment with reduced memory footprint
- Assistant-like chat functionality in English
- Compatible with vLLM and Transformers deployment
- Benchmark performance matching the original model (99.4-101.4% recovery across tasks)
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for achieving significant efficiency gains through quantization while maintaining performance metrics nearly identical to the original model. It's particularly notable for its balanced approach to optimization, reducing resource requirements without compromising functionality.
Q: What are the recommended use cases?
The model is specifically designed for commercial and research applications in English language tasks, particularly suited for assistant-like chat interactions. It's ideal for deployments where resource efficiency is crucial while maintaining high-quality language model capabilities.