DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic
Property | Value |
---|---|
Model Type | Qwen2ForCausalLM |
Developer | Neural Magic |
Release Date | February 5, 2025 |
Quantization | FP8 (Weights & Activations) |
Model URL | Hugging Face Repository |
What is DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic?
This is an optimized version of the DeepSeek-R1-Distill-Qwen-14B model that employs FP8 quantization for both weights and activations. The model achieves approximately 50% reduction in disk size and GPU memory requirements while maintaining comparable performance to its parent model.
Implementation Details
The model implements symmetric quantization schemes: per-channel for weights and per-token for activations. Only the linear operators within transformer blocks are quantized, preserving model accuracy while significantly improving efficiency.
- Weight quantization reduces bits per parameter from 16 to 8
- Achieves up to 1.4x speedup in both single-stream and multi-stream deployment
- Compatible with vLLM backend for efficient deployment
- Maintains 99.8% accuracy on OpenLLM V1 benchmark compared to the original model
Core Capabilities
- Strong performance in reasoning tasks (74.29% average score)
- Excellent coding capabilities (77.20% pass@1 on HumanEval)
- Efficient large context handling (up to 4096 tokens)
- Optimized for both single-stream and multi-stream inference
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its efficient FP8 quantization that reduces resource requirements by 50% while maintaining over 99% of the original model's performance across most benchmarks. It's particularly notable for achieving better performance than the original model in some reasoning tasks.
Q: What are the recommended use cases?
The model excels in instruction following, code generation, and reasoning tasks. It's particularly well-suited for deployment scenarios where resource efficiency is crucial, showing strong performance in both single-stream and multi-stream applications.