Llama-3.2-3B-Instruct-FP8

Llama-3.2-3B-Instruct-FP8

neuralmagic

Optimized 3B parameter LLaMA-3 model with FP8 quantization, offering 50% memory reduction while maintaining 99.7% performance across benchmarks

PropertyValue
Parameter Count3.61B
Model TypeInstruction-tuned Language Model
ArchitectureLLaMA-3
LicenseLLaMA 3.2
Supported Languages8 (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai)

What is Llama-3.2-3B-Instruct-FP8?

Llama-3.2-3B-Instruct-FP8 is an optimized version of Meta's Llama-3.2-3B-Instruct model, featuring innovative FP8 quantization for both weights and activations. This optimization reduces GPU memory requirements by approximately 50% while maintaining remarkable performance, achieving 99.7% of the original model's capabilities across various benchmarks.

Implementation Details

The model employs sophisticated quantization techniques, specifically targeting linear operators within transformer blocks. It uses symmetric static per-channel quantization for weights and symmetric per-tensor quantization for activations, all implemented through the llm-compressor library.

  • Weight quantization reduces from 16 to 8 bits
  • 50% reduction in GPU memory usage
  • 2x increase in matrix-multiply compute throughput
  • Calibrated using 512 sequences from Neural Magic's LLM compression dataset

Core Capabilities

  • Multi-lingual support across 8 languages
  • Assistant-style chat functionality
  • Benchmark performance: 62.61% on MMLU (5-shot), 77.86% on GSM-8K
  • Efficient deployment through vLLM backend
  • Optimized for commercial and research applications

Frequently Asked Questions

Q: What makes this model unique?

The model's primary distinction lies in its efficient FP8 quantization, which significantly reduces resource requirements while maintaining near-original performance. This makes it particularly valuable for deployment scenarios where computational resources are constrained.

Q: What are the recommended use cases?

The model is ideal for commercial and research applications requiring multi-lingual capabilities. It excels in assistant-like chat scenarios and can be effectively deployed in production environments using the vLLM backend for optimal performance.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026