DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
Property | Value |
---|---|
Model Type | LlamaForCausalLM |
Parameter Count | 70B |
Quantization | FP8 Dynamic |
Release Date | 2/1/2025 |
Developer | Neural Magic |
Model URL | huggingface.co/neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic |
What is DeepSeek-R1-Distill-Llama-70B-FP8-dynamic?
This is a quantized version of the DeepSeek-R1-Distill-Llama-70B model, optimized using FP8 dynamic quantization to reduce model size while maintaining performance. The model achieves remarkable efficiency by reducing the number of bits per parameter from 16 to 8, resulting in approximately 50% reduction in disk size and GPU memory requirements.
Implementation Details
The model employs sophisticated quantization techniques, specifically targeting the linear operators within transformer blocks. It uses symmetric per-channel quantization for weights and symmetric per-token quantization for activations, implemented through LLM Compressor.
- Achieves up to 1.4x speedup in single-stream deployment
- Up to 3.0x speedup in multi-stream asynchronous deployment
- Maintains 99.8% accuracy compared to the original model on OpenLLM V1 benchmarks
- Successfully deploys using vLLM backend for efficient inference
Core Capabilities
- Strong performance in reasoning tasks (76.49% average score)
- Excellent coding capabilities (81% pass@1 on HumanEval)
- Robust performance across various benchmarks including MATH-500 (95.14%) and GSM8K (93.03%)
- Efficient scaling across different GPU configurations (A6000, A100, H100)
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its exceptional balance between efficiency and performance, using FP8 dynamic quantization to significantly reduce resource requirements while maintaining over 99% of the original model's accuracy across most benchmarks.
Q: What are the recommended use cases?
The model excels in various scenarios including instruction following, multi-turn chat, code completion, and large-scale text processing. It's particularly effective in deployment scenarios where resource optimization is crucial while maintaining high performance standards.