Llama-3.2-1B-Instruct-FP8-dynamic
Property | Value |
---|---|
Model Base | Meta-Llama-3.2 |
Release Date | 9/25/2024 |
License | llama3.2 |
Developer | Neural Magic |
Hugging Face | Model Repository |
What is Llama-3.2-1B-Instruct-FP8-dynamic?
Llama-3.2-1B-Instruct-FP8-dynamic is an optimized version of the Llama-3.2-1B-Instruct model, featuring FP8 quantization for both weights and activations. This optimization significantly reduces the model's memory footprint while maintaining impressive performance, achieving 99.7% of the original model's accuracy across various benchmarks.
Implementation Details
The model employs sophisticated quantization techniques, converting the original 16-bit parameters to 8-bit representations. Key technical aspects include:
- Symmetric per-channel quantization for weights
- Dynamic per-token quantization for activations
- 50% reduction in disk size and GPU memory requirements
- Quantization applied only to linear operators within transformer blocks
- Compatible with vLLM backend for efficient deployment
Core Capabilities
- Matches original model performance with scores of 47.55% on MMLU (5-shot)
- Achieves 57.25% on ARC Challenge (0-shot)
- Maintains 45.94% accuracy on GSM-8K-cot (8-shot)
- Effective on multiple benchmarks including Winogrande, Hellaswag, and TruthfulQA
- Optimized for assistant-like chat applications
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its efficient FP8 quantization scheme that maintains high performance while significantly reducing resource requirements. It achieves this through advanced dynamic quantization techniques for activations and symmetric quantization for weights.
Q: What are the recommended use cases?
The model is specifically designed for commercial and research applications in English language processing, particularly suited for assistant-like chat interactions. It's ideal for deployments where resource efficiency is crucial without compromising on performance.