DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

Property	Value
Model Type	Qwen2ForCausalLM
Quantization	INT8 (Weights & Activations)
Release Date	February 5, 2025
Developer	Neural Magic
Model URL	https://huggingface.co/neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

What is DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8?

This is an optimized version of the DeepSeek-R1-Distill-Qwen-32B model that implements INT8 quantization for both weights and activations. The quantization reduces memory requirements by approximately 50% while maintaining 99.57% of the original model's accuracy. The model achieves up to 1.8x speedup in single-stream deployment and 2.2x speedup in multi-stream asynchronous deployment.

Implementation Details

The model uses symmetric per-channel quantization for weights and symmetric per-token quantization for activations. The GPTQ algorithm is employed for quantization through the llm-compressor library. Only the linear operators within transformer blocks are quantized, preserving the model's core functionality while optimizing resource usage.

Reduces GPU memory requirements by ~50%
Increases matrix-multiply compute throughput by ~2x
Maintains near-original accuracy across benchmarks
Compatible with vLLM backend for efficient deployment

Core Capabilities

Strong performance in reasoning tasks (94.98% on MATH-500)
Excellent coding capabilities (85.80% pass@1 on HumanEval)
Robust performance on OpenLLM benchmarks (74.50% average score on V1)
Efficient RAG and summarization processing

Frequently Asked Questions

Q: What makes this model unique?

The model combines high-performance quantization with minimal accuracy loss, making it particularly suitable for production deployment where resource efficiency is crucial. It achieves this while maintaining performance across a wide range of tasks, from reasoning to coding.

Q: What are the recommended use cases?

The model excels in scenarios requiring efficient deployment, including instruction following, multi-turn chat, code generation, and RAG applications. It's particularly valuable for deployments where memory constraints or inference speed are critical factors.