DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

Maintained By
neuralmagic

DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

PropertyValue
Model TypeLlamaForCausalLM
Parameter Count70B
QuantizationFP8 Dynamic
Release Date2/1/2025
DeveloperNeural Magic
Model URLhuggingface.co/neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

What is DeepSeek-R1-Distill-Llama-70B-FP8-dynamic?

This is a quantized version of the DeepSeek-R1-Distill-Llama-70B model, optimized using FP8 dynamic quantization to reduce model size while maintaining performance. The model achieves remarkable efficiency by reducing the number of bits per parameter from 16 to 8, resulting in approximately 50% reduction in disk size and GPU memory requirements.

Implementation Details

The model employs sophisticated quantization techniques, specifically targeting the linear operators within transformer blocks. It uses symmetric per-channel quantization for weights and symmetric per-token quantization for activations, implemented through LLM Compressor.

  • Achieves up to 1.4x speedup in single-stream deployment
  • Up to 3.0x speedup in multi-stream asynchronous deployment
  • Maintains 99.8% accuracy compared to the original model on OpenLLM V1 benchmarks
  • Successfully deploys using vLLM backend for efficient inference

Core Capabilities

  • Strong performance in reasoning tasks (76.49% average score)
  • Excellent coding capabilities (81% pass@1 on HumanEval)
  • Robust performance across various benchmarks including MATH-500 (95.14%) and GSM8K (93.03%)
  • Efficient scaling across different GPU configurations (A6000, A100, H100)

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional balance between efficiency and performance, using FP8 dynamic quantization to significantly reduce resource requirements while maintaining over 99% of the original model's accuracy across most benchmarks.

Q: What are the recommended use cases?

The model excels in various scenarios including instruction following, multi-turn chat, code completion, and large-scale text processing. It's particularly effective in deployment scenarios where resource optimization is crucial while maintaining high performance standards.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.