DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

Maintained By
neuralmagic

DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

PropertyValue
Model TypeQwen2ForCausalLM
QuantizationINT8 (Weights & Activations)
Release DateFebruary 5, 2025
DeveloperNeural Magic
Model URLhttps://huggingface.co/neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

What is DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8?

This is an optimized version of the DeepSeek-R1-Distill-Qwen-32B model that implements INT8 quantization for both weights and activations. The quantization reduces memory requirements by approximately 50% while maintaining 99.57% of the original model's accuracy. The model achieves up to 1.8x speedup in single-stream deployment and 2.2x speedup in multi-stream asynchronous deployment.

Implementation Details

The model uses symmetric per-channel quantization for weights and symmetric per-token quantization for activations. The GPTQ algorithm is employed for quantization through the llm-compressor library. Only the linear operators within transformer blocks are quantized, preserving the model's core functionality while optimizing resource usage.

  • Reduces GPU memory requirements by ~50%
  • Increases matrix-multiply compute throughput by ~2x
  • Maintains near-original accuracy across benchmarks
  • Compatible with vLLM backend for efficient deployment

Core Capabilities

  • Strong performance in reasoning tasks (94.98% on MATH-500)
  • Excellent coding capabilities (85.80% pass@1 on HumanEval)
  • Robust performance on OpenLLM benchmarks (74.50% average score on V1)
  • Efficient RAG and summarization processing

Frequently Asked Questions

Q: What makes this model unique?

The model combines high-performance quantization with minimal accuracy loss, making it particularly suitable for production deployment where resource efficiency is crucial. It achieves this while maintaining performance across a wide range of tasks, from reasoning to coding.

Q: What are the recommended use cases?

The model excels in scenarios requiring efficient deployment, including instruction following, multi-turn chat, code generation, and RAG applications. It's particularly valuable for deployments where memory constraints or inference speed are critical factors.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.