Meta-Llama-3-8B-Instruct-quantized.w8a8

Maintained By
neuralmagic

Meta-Llama-3-8B-Instruct-quantized.w8a8

PropertyValue
AuthorNeural Magic
LicenseLlama3
Release Date7/11/2024
Model TypeQuantized Language Model
Base ArchitectureMeta-Llama-3

What is Meta-Llama-3-8B-Instruct-quantized.w8a8?

This is a quantized version of Meta-Llama-3-8B-Instruct, specifically optimized for efficient deployment while maintaining performance. The model employs INT8 quantization for both weights and activations, resulting in approximately 50% reduced GPU memory requirements and 2x increased matrix-multiply compute throughput.

Implementation Details

The model utilizes advanced quantization techniques including GPTQ algorithm with a 1% damping factor, processing 256 sequences of 8,192 random tokens. The quantization is applied specifically to linear operators within transformer blocks, using symmetric static per-channel scheme for weights and symmetric dynamic per-token scheme for activations.

  • Weight quantization: INT8 data type with per-channel scaling
  • Activation quantization: INT8 with dynamic per-token scaling
  • 50% reduction in disk size requirements
  • Maintains 68.66 average score on OpenLLM benchmark (v1)

Core Capabilities

  • Efficient deployment with reduced memory footprint
  • Assistant-like chat functionality in English
  • Compatible with vLLM and Transformers deployment
  • Benchmark performance matching the original model (99.4-101.4% recovery across tasks)

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for achieving significant efficiency gains through quantization while maintaining performance metrics nearly identical to the original model. It's particularly notable for its balanced approach to optimization, reducing resource requirements without compromising functionality.

Q: What are the recommended use cases?

The model is specifically designed for commercial and research applications in English language tasks, particularly suited for assistant-like chat interactions. It's ideal for deployments where resource efficiency is crucial while maintaining high-quality language model capabilities.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.