DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic

Maintained By
neuralmagic

DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic

PropertyValue
Model TypeQwen2ForCausalLM
DeveloperNeural Magic
Release DateFebruary 5, 2025
QuantizationFP8 (Weights & Activations)
Model URLHugging Face

What is DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic?

This is an optimized version of the DeepSeek-R1-Distill-Qwen-32B model that uses FP8 quantization for both weights and activations. The optimization reduces the model's disk size and GPU memory requirements by approximately 50% while maintaining 99.8% of the original model's accuracy. It's specifically designed for efficient deployment using the vLLM backend.

Implementation Details

The model employs sophisticated quantization techniques where weights are quantized using a symmetric per-channel scheme and activations using a symmetric per-token scheme. Only the linear operators within transformer blocks are quantized, preserving critical model components for optimal performance.

  • Achieves up to 1.5x speedup in single-stream deployment
  • Delivers up to 1.7x speedup in multi-stream asynchronous deployment
  • Maintains exceptional accuracy across various benchmarks including OpenLLM V1 and V2
  • Supports efficient deployment through vLLM with OpenAI-compatible serving

Core Capabilities

  • Strong performance in reasoning tasks (75.55 average score)
  • Excellent coding capabilities (85.20% pass@1 on HumanEval)
  • Robust performance across various OpenLLM benchmarks
  • Significant cost reduction in deployment scenarios
  • Enhanced throughput for both single and multi-stream operations

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimal balance between performance and efficiency, achieving significant speed improvements while maintaining accuracy through FP8 quantization. It's particularly notable for reducing deployment costs while preserving the core capabilities of the original model.

Q: What are the recommended use cases?

The model excels in various scenarios including instruction following, multi-turn chat, code completion, and large-scale text processing. It's particularly well-suited for production environments where deployment efficiency and cost optimization are crucial while maintaining high-quality outputs.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.