Mistral-Nemo-Instruct-2407-FP8

neuralmagic

Optimized 12.2B parameter Mistral model quantized to FP8, offering 50% memory reduction while maintaining 99.53% performance of original model.

Property	Value
Parameter Count	12.2B
License	Apache 2.0
Tensor Type	BF16/F8_E4M3
OpenLLM Score	71.28

What is Mistral-Nemo-Instruct-2407-FP8?

Mistral-Nemo-Instruct-2407-FP8 is an optimized version of the original Mistral-Nemo-Instruct model, specifically designed for efficient deployment while maintaining high performance. Through FP8 quantization, it achieves approximately 50% reduction in disk size and GPU memory requirements compared to the original model, while preserving 99.53% of its performance.

Implementation Details

The model employs sophisticated optimization techniques, particularly in its quantization approach. It uses symmetric per-tensor quantization for both weights and activations of linear operators within transformer blocks, implementing the FP8 data type through the AutoFP8 framework with calibration on 512 sequences of UltraChat.

Weight and activation quantization to FP8
Compatible with vLLM >= 0.5.0
4096 token context window
Optimized for commercial and research applications

Core Capabilities

Achieves 71.28 average score on OpenLLM benchmark
Excels in various tasks: MMLU (68.50%), GSM-8K (73.01%), Hellaswag (84.18%)
Supports efficient deployment through vLLM backend
Specialized for English language tasks and assistant-like chat applications

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its efficient FP8 quantization that reduces resource requirements by 50% while maintaining over 99% of the original model's performance, making it particularly suitable for production deployment.

Q: What are the recommended use cases?

The model is optimized for English language applications, particularly in commercial and research contexts requiring assistant-like chat functionality. It's specifically designed for deployment scenarios where resource efficiency is crucial without compromising performance.