Llama-3.1-8B-Instruct-FP8

Llama-3.1-8B-Instruct-FP8

nvidia

NVIDIA's quantized 8B parameter LLaMA model optimized for FP8 precision, offering 1.3x speedup on H100 GPUs while maintaining strong performance across benchmarks.

PropertyValue
Model Size8B parameters
LicenseNVIDIA Open Model License
Supported HardwareNVIDIA Blackwell, Hopper, Lovelace
QuantizationFP8
Model URLhuggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8

What is Llama-3.1-8B-Instruct-FP8?

The NVIDIA Llama-3.1-8B-Instruct-FP8 is a quantized version of Meta's Llama 3.1 8B Instruct model, optimized for efficient inference while maintaining impressive performance. This model represents a significant advancement in model optimization, reducing both disk space and GPU memory requirements by approximately 50% through FP8 quantization.

Implementation Details

The model employs quantization specifically on the weights and activations of linear operators within transformer blocks, achieving a remarkable 1.3x speedup on H100 GPUs compared to the original BF16 version. It can be deployed using either TensorRT-LLM or vLLM runtime engines, supporting context lengths up to 128K tokens.

  • Calibrated using CNN/DailyMail dataset
  • Evaluated on MMLU, GSM8K, ARC Challenge, and IFEVAL benchmarks
  • Maintains strong performance metrics (68.7% on MMLU, 83.1% on GSM8K)
  • Achieves 11,062.90 TPS compared to original's 8,579.93 TPS

Core Capabilities

  • Efficient inference with reduced memory footprint
  • High-performance text generation and instruction following
  • Seamless integration with TensorRT-LLM and vLLM
  • Support for commercial and non-commercial applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimized FP8 quantization, which significantly reduces resource requirements while maintaining performance within 1-2% of the original model across key benchmarks.

Q: What are the recommended use cases?

The model is ideal for production environments where efficiency is crucial, particularly in applications requiring high-throughput text generation and instruction following. It's especially suitable for deployment on NVIDIA's latest GPU architectures.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026