DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8
Property | Value |
---|---|
Model Type | Qwen2ForCausalLM |
Quantization | INT8 (Weights & Activations) |
Release Date | February 5, 2025 |
Developer | Neural Magic |
Model URL | https://huggingface.co/neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8 |
What is DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8?
This is an optimized version of the DeepSeek-R1-Distill-Qwen-32B model that implements INT8 quantization for both weights and activations. The quantization reduces memory requirements by approximately 50% while maintaining 99.57% of the original model's accuracy. The model achieves up to 1.8x speedup in single-stream deployment and 2.2x speedup in multi-stream asynchronous deployment.
Implementation Details
The model uses symmetric per-channel quantization for weights and symmetric per-token quantization for activations. The GPTQ algorithm is employed for quantization through the llm-compressor library. Only the linear operators within transformer blocks are quantized, preserving the model's core functionality while optimizing resource usage.
- Reduces GPU memory requirements by ~50%
- Increases matrix-multiply compute throughput by ~2x
- Maintains near-original accuracy across benchmarks
- Compatible with vLLM backend for efficient deployment
Core Capabilities
- Strong performance in reasoning tasks (94.98% on MATH-500)
- Excellent coding capabilities (85.80% pass@1 on HumanEval)
- Robust performance on OpenLLM benchmarks (74.50% average score on V1)
- Efficient RAG and summarization processing
Frequently Asked Questions
Q: What makes this model unique?
The model combines high-performance quantization with minimal accuracy loss, making it particularly suitable for production deployment where resource efficiency is crucial. It achieves this while maintaining performance across a wide range of tasks, from reasoning to coding.
Q: What are the recommended use cases?
The model excels in scenarios requiring efficient deployment, including instruction following, multi-turn chat, code generation, and RAG applications. It's particularly valuable for deployments where memory constraints or inference speed are critical factors.