Llama-3.2-3B-Instruct-FP8

neuralmagic

Optimized 3B parameter LLaMA-3 model with FP8 quantization, offering 50% memory reduction while maintaining 99.7% performance across benchmarks

Property	Value
Parameter Count	3.61B
Model Type	Instruction-tuned Language Model
Architecture	LLaMA-3
License	LLaMA 3.2
Supported Languages	8 (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai)

What is Llama-3.2-3B-Instruct-FP8?

Llama-3.2-3B-Instruct-FP8 is an optimized version of Meta's Llama-3.2-3B-Instruct model, featuring innovative FP8 quantization for both weights and activations. This optimization reduces GPU memory requirements by approximately 50% while maintaining remarkable performance, achieving 99.7% of the original model's capabilities across various benchmarks.

Implementation Details

The model employs sophisticated quantization techniques, specifically targeting linear operators within transformer blocks. It uses symmetric static per-channel quantization for weights and symmetric per-tensor quantization for activations, all implemented through the llm-compressor library.

Weight quantization reduces from 16 to 8 bits
50% reduction in GPU memory usage
2x increase in matrix-multiply compute throughput
Calibrated using 512 sequences from Neural Magic's LLM compression dataset

Core Capabilities

Multi-lingual support across 8 languages
Assistant-style chat functionality
Benchmark performance: 62.61% on MMLU (5-shot), 77.86% on GSM-8K
Efficient deployment through vLLM backend
Optimized for commercial and research applications

Frequently Asked Questions

Q: What makes this model unique?

The model's primary distinction lies in its efficient FP8 quantization, which significantly reduces resource requirements while maintaining near-original performance. This makes it particularly valuable for deployment scenarios where computational resources are constrained.

Q: What are the recommended use cases?

The model is ideal for commercial and research applications requiring multi-lingual capabilities. It excels in assistant-like chat scenarios and can be effectively deployed in production environments using the vLLM backend for optimal performance.