DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic

Property	Value
Model Type	Qwen2ForCausalLM
Developer	Neural Magic
Release Date	February 5, 2025
Quantization	FP8 (Weights & Activations)
Model URL	Hugging Face Repository

What is DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic?

This is an optimized version of the DeepSeek-R1-Distill-Qwen-14B model that employs FP8 quantization for both weights and activations. The model achieves approximately 50% reduction in disk size and GPU memory requirements while maintaining comparable performance to its parent model.

Implementation Details

The model implements symmetric quantization schemes: per-channel for weights and per-token for activations. Only the linear operators within transformer blocks are quantized, preserving model accuracy while significantly improving efficiency.

Weight quantization reduces bits per parameter from 16 to 8
Achieves up to 1.4x speedup in both single-stream and multi-stream deployment
Compatible with vLLM backend for efficient deployment
Maintains 99.8% accuracy on OpenLLM V1 benchmark compared to the original model

Core Capabilities

Strong performance in reasoning tasks (74.29% average score)
Excellent coding capabilities (77.20% pass@1 on HumanEval)
Efficient large context handling (up to 4096 tokens)
Optimized for both single-stream and multi-stream inference

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its efficient FP8 quantization that reduces resource requirements by 50% while maintaining over 99% of the original model's performance across most benchmarks. It's particularly notable for achieving better performance than the original model in some reasoning tasks.

Q: What are the recommended use cases?

The model excels in instruction following, code generation, and reasoning tasks. It's particularly well-suited for deployment scenarios where resource efficiency is crucial, showing strong performance in both single-stream and multi-stream applications.