DeepSeek-R1-FP4

nvidia

NVIDIA's quantized version of DeepSeek R1, optimized for efficient inference with FP4 precision and 128K context length, running on TensorRT-LLM.

Property	Value
License	MIT
Architecture	Transformer-based DeepSeek R1
Quantization	FP4
Context Length	128K tokens
Hardware Support	NVIDIA Blackwell
Model URL	https://huggingface.co/nvidia/DeepSeek-R1-FP4

What is DeepSeek-R1-FP4?

DeepSeek-R1-FP4 is NVIDIA's quantized version of the DeepSeek R1 auto-regressive language model, optimized for efficient inference using FP4 precision. This model represents a significant advancement in model optimization, reducing the bits per parameter from 8 to 4, resulting in approximately 1.6x reduction in disk size and GPU memory requirements while maintaining performance.

Implementation Details

The model leverages TensorRT-LLM for deployment and requires 8xB200 GPUs for optimal performance. The quantization process specifically targets the weights and activations of linear operators within transformer blocks, providing an efficient balance between performance and resource utilization.

Optimized using nvidia-modelopt v0.23.0
Supports up to 128K context length
Calibrated using cnn_dailymail dataset
Evaluated on MMLU benchmark

Core Capabilities

Efficient text generation with reduced memory footprint
High-performance inference using TensorRT-LLM
Support for long context understanding
Optimized for commercial and non-commercial applications

Frequently Asked Questions

Q: What makes this model unique?

The model's FP4 quantization significantly reduces resource requirements while maintaining performance, making it ideal for production deployments on NVIDIA hardware.

Q: What are the recommended use cases?

The model is suitable for various text generation tasks requiring efficient inference, particularly in production environments where resource optimization is crucial while maintaining high performance.