Meta-Llama-3.1-8B-Instruct-FP8

neuralmagic

8B parameter FP8-quantized LLaMA 3.1 model optimized for efficient inference, supporting 8 languages with 99.52% performance retention

Property	Value
Parameter Count	8.03B
Model Type	Instruction-tuned LLM
Supported Languages	8 (en, de, fr, it, pt, hi, es, th)
License	llama3.1
Quantization	FP8 (weights and activations)

What is Meta-Llama-3.1-8B-Instruct-FP8?

Meta-Llama-3.1-8B-Instruct-FP8 is an optimized version of Meta's LLaMA 3.1 model, specifically designed for efficient deployment while maintaining nearly identical performance to its full-precision counterpart. Through FP8 quantization, it achieves a 50% reduction in disk size and GPU memory requirements while retaining 99.52% of the original model's performance.

Implementation Details

The model utilizes symmetric per-tensor quantization for both weights and activations of linear operators within transformer blocks. It's optimized for deployment with vLLM and was calibrated using 512 sequences from UltraChat dataset.

Achieves 73.44 average score on OpenLLM benchmark (vs 73.79 for original)
Optimized for commercial and research applications
Compatible with vLLM for efficient inference

Core Capabilities

Multi-lingual support across 8 languages
Assistant-style chat functionality
Strong performance on key benchmarks (MMLU: 67.97%, ARC Challenge: 81.66%, GSM-8K: 81.12%)
50% reduced resource requirements compared to original model

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient FP8 quantization that dramatically reduces resource requirements while maintaining over 99.5% of the original model's performance across all benchmarks. It's particularly notable for maintaining this high performance across multiple languages and complex reasoning tasks.

Q: What are the recommended use cases?

The model is ideal for commercial and research applications requiring efficient deployment of large language models, particularly in multi-lingual contexts. It's specifically designed for assistant-like chat applications where resource optimization is crucial but performance cannot be compromised.