Llama-3.2-1B-Instruct-FP8

neuralmagic

Optimized 1.5B parameter Llama-3 model quantized to FP8, offering 50% memory reduction while maintaining 99.8% accuracy of original model. Supports 8 languages.

Property	Value
Parameter Count	1.5B parameters
Model Type	Instruction-tuned Language Model
Architecture	Llama-3
License	Llama3.2
Supported Languages	English, German, French, Italian, Portuguese, Hindi, Spanish, Thai

What is Llama-3.2-1B-Instruct-FP8?

Llama-3.2-1B-Instruct-FP8 is an optimized version of the original Llama-3.2-1B-Instruct model, specifically designed to provide efficient performance while maintaining accuracy. This model represents a significant advancement in model compression, utilizing FP8 quantization to reduce both memory requirements and computational demands.

Implementation Details

The model employs sophisticated quantization techniques, converting weights and activations from 16-bit to 8-bit precision. This optimization results in approximately 50% reduction in GPU memory usage and doubles the matrix-multiply compute throughput. The quantization process uses a symmetric static per-channel scheme for weights and a symmetric per-tensor scheme for activations.

Weight quantization reduces memory footprint by 50%
Calibrated using 512 sequences from Neural Magic's calibration dataset
Maintains performance within 1% of the original model
Implements FP8 data type for optimal efficiency

Core Capabilities

Multi-lingual support across 8 languages
Assistant-style chat functionality
Achieves 52.11% average score across major benchmarks
Efficient deployment using vLLM backend
Enhanced throughput for production environments

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its exceptional balance between efficiency and performance. The FP8 quantization enables significant resource savings while maintaining 99.8% of the original model's accuracy across major benchmarks like MMLU, ARC-Challenge, and GSM-8k.

Q: What are the recommended use cases?

The model is ideal for commercial and research applications requiring multilingual capabilities and assistant-like chat functionality. It's particularly suitable for deployment scenarios where resource efficiency is crucial while maintaining high performance standards.