Llama-3.1-8B-Instruct-FP8

nvidia

NVIDIA's quantized 8B parameter LLaMA model optimized for FP8 precision, offering 1.3x speedup on H100 GPUs while maintaining strong performance across benchmarks.

Property	Value
Model Size	8B parameters
License	NVIDIA Open Model License
Supported Hardware	NVIDIA Blackwell, Hopper, Lovelace
Quantization	FP8
Model URL	huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8

What is Llama-3.1-8B-Instruct-FP8?

The NVIDIA Llama-3.1-8B-Instruct-FP8 is a quantized version of Meta's Llama 3.1 8B Instruct model, optimized for efficient inference while maintaining impressive performance. This model represents a significant advancement in model optimization, reducing both disk space and GPU memory requirements by approximately 50% through FP8 quantization.

Implementation Details

The model employs quantization specifically on the weights and activations of linear operators within transformer blocks, achieving a remarkable 1.3x speedup on H100 GPUs compared to the original BF16 version. It can be deployed using either TensorRT-LLM or vLLM runtime engines, supporting context lengths up to 128K tokens.

Calibrated using CNN/DailyMail dataset
Evaluated on MMLU, GSM8K, ARC Challenge, and IFEVAL benchmarks
Maintains strong performance metrics (68.7% on MMLU, 83.1% on GSM8K)
Achieves 11,062.90 TPS compared to original's 8,579.93 TPS

Core Capabilities

Efficient inference with reduced memory footprint
High-performance text generation and instruction following
Seamless integration with TensorRT-LLM and vLLM
Support for commercial and non-commercial applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimized FP8 quantization, which significantly reduces resource requirements while maintaining performance within 1-2% of the original model across key benchmarks.

Q: What are the recommended use cases?

The model is ideal for production environments where efficiency is crucial, particularly in applications requiring high-throughput text generation and instruction following. It's especially suitable for deployment on NVIDIA's latest GPU architectures.