Llama-3.2-1B-Instruct-FP8-dynamic

Llama-3.2-1B-Instruct-FP8-dynamic

neuralmagic

Optimized Llama-3.2-1B-Instruct model with FP8 quantization, reducing memory footprint by 50% while maintaining 99.7% accuracy

PropertyValue
Model BaseMeta-Llama-3.2
Release Date9/25/2024
Licensellama3.2
DeveloperNeural Magic
Hugging FaceModel Repository

What is Llama-3.2-1B-Instruct-FP8-dynamic?

Llama-3.2-1B-Instruct-FP8-dynamic is an optimized version of the Llama-3.2-1B-Instruct model, featuring FP8 quantization for both weights and activations. This optimization significantly reduces the model's memory footprint while maintaining impressive performance, achieving 99.7% of the original model's accuracy across various benchmarks.

Implementation Details

The model employs sophisticated quantization techniques, converting the original 16-bit parameters to 8-bit representations. Key technical aspects include:

  • Symmetric per-channel quantization for weights
  • Dynamic per-token quantization for activations
  • 50% reduction in disk size and GPU memory requirements
  • Quantization applied only to linear operators within transformer blocks
  • Compatible with vLLM backend for efficient deployment

Core Capabilities

  • Matches original model performance with scores of 47.55% on MMLU (5-shot)
  • Achieves 57.25% on ARC Challenge (0-shot)
  • Maintains 45.94% accuracy on GSM-8K-cot (8-shot)
  • Effective on multiple benchmarks including Winogrande, Hellaswag, and TruthfulQA
  • Optimized for assistant-like chat applications

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its efficient FP8 quantization scheme that maintains high performance while significantly reducing resource requirements. It achieves this through advanced dynamic quantization techniques for activations and symmetric quantization for weights.

Q: What are the recommended use cases?

The model is specifically designed for commercial and research applications in English language processing, particularly suited for assistant-like chat interactions. It's ideal for deployments where resource efficiency is crucial without compromising on performance.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026