Meta-Llama-3-8B-Instruct-FP8

neuralmagic

8B parameter Llama-3 model optimized with FP8 quantization, achieving 99.28% accuracy recovery vs original while halving memory requirements

Property	Value
Parameter Count	8.03B
Model Type	Instruction-tuned Language Model
License	Llama3
Quantization	FP8
Language	English

What is Meta-Llama-3-8B-Instruct-FP8?

Meta-Llama-3-8B-Instruct-FP8 is a quantized version of the original Llama-3 8B model, optimized for efficient deployment while maintaining near-original performance. Through FP8 quantization, it reduces the model's memory footprint by approximately 50% while preserving 99.28% of the original model's accuracy.

Implementation Details

The model implements symmetric per-tensor quantization on the linear operators within transformer blocks, using AutoFP8 with calibration samples from UltraChat. It's specifically designed for deployment with vLLM >= 0.5.0 and achieves an impressive average score of 68.22 on the OpenLLM benchmark.

Weight and activation quantization using FP8 data type
50% reduction in disk size and GPU memory requirements
Optimized for vLLM deployment
Calibrated using 512 sequences of UltraChat

Core Capabilities

Assistant-like chat functionality
Maintains high performance across various benchmarks (MMLU, ARC Challenge, GSM-8K)
Efficient inference with reduced resource requirements
English language processing and generation

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient FP8 quantization that significantly reduces resource requirements while maintaining 99.28% of the original model's performance. It's specifically optimized for production deployment with vLLM.

Q: What are the recommended use cases?

The model is best suited for commercial and research applications requiring English language processing, particularly in assistant-like chat scenarios where resource efficiency is important. It's designed to handle various tasks while consuming less memory than the original model.