Meta-Llama-3-70B-Instruct-FP8

neuralmagic

Meta's Llama-3 70B model optimized with FP8 quantization, reducing memory footprint by 50% while maintaining 99.55% accuracy. Ideal for commercial and research applications.

Property	Value
Parameter Count	70.6B
Model Type	Language Model (Instruct)
License	Llama3
Quantization	FP8
OpenLLM Score	79.16

What is Meta-Llama-3-70B-Instruct-FP8?

Meta-Llama-3-70B-Instruct-FP8 is an optimized version of Meta's Llama-3 70B model, specifically designed for efficient deployment while maintaining near-original performance. This model implements FP8 quantization for both weights and activations, effectively reducing the model's memory footprint by approximately 50% compared to the original 16-bit version.

Implementation Details

The model employs sophisticated quantization techniques using AutoFP8, focusing on the linear operators within transformer blocks. It achieves remarkable efficiency while maintaining 99.55% of the original model's performance on benchmark tasks.

Weight and activation quantization using FP8 data type
Symmetric per-tensor quantization implementation
Compatible with vLLM >= 0.5.0 for inference
Calibrated using 512 sequences from UltraChat

Core Capabilities

Benchmark Performance: 80.06% on MMLU (5-shot)
Strong reasoning capabilities with 91.12% on GSM-8K
Excellent performance on Hellaswag (85.41%) and Winogrande (83.03%)
Optimized for English language tasks
Suitable for commercial and research applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimal balance between performance and efficiency, using FP8 quantization to reduce resource requirements while maintaining 99.55% of the original model's accuracy. It's specifically optimized for deployment with vLLM, making it ideal for production environments.

Q: What are the recommended use cases?

The model is best suited for English language tasks, particularly in commercial and research applications requiring assistant-like chat capabilities. It's optimized for deployment scenarios where resource efficiency is crucial while maintaining high performance standards.