Llama-3.2-1B-Instruct-FP8-dynamic

Property	Value
Model Base	Meta-Llama-3.2
Release Date	9/25/2024
License	llama3.2
Developer	Neural Magic
Hugging Face	Model Repository

What is Llama-3.2-1B-Instruct-FP8-dynamic?

Llama-3.2-1B-Instruct-FP8-dynamic is an optimized version of the Llama-3.2-1B-Instruct model, featuring FP8 quantization for both weights and activations. This optimization significantly reduces the model's memory footprint while maintaining impressive performance, achieving 99.7% of the original model's accuracy across various benchmarks.

Implementation Details

The model employs sophisticated quantization techniques, converting the original 16-bit parameters to 8-bit representations. Key technical aspects include:

Symmetric per-channel quantization for weights
Dynamic per-token quantization for activations
50% reduction in disk size and GPU memory requirements
Quantization applied only to linear operators within transformer blocks
Compatible with vLLM backend for efficient deployment

Core Capabilities

Matches original model performance with scores of 47.55% on MMLU (5-shot)
Achieves 57.25% on ARC Challenge (0-shot)
Maintains 45.94% accuracy on GSM-8K-cot (8-shot)
Effective on multiple benchmarks including Winogrande, Hellaswag, and TruthfulQA
Optimized for assistant-like chat applications

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its efficient FP8 quantization scheme that maintains high performance while significantly reducing resource requirements. It achieves this through advanced dynamic quantization techniques for activations and symmetric quantization for weights.

Q: What are the recommended use cases?

The model is specifically designed for commercial and research applications in English language processing, particularly suited for assistant-like chat interactions. It's ideal for deployments where resource efficiency is crucial without compromising on performance.