DeepSeek-R1-Distill-Qwen-7B-quantized.w8a8
Property | Value |
---|---|
Model Type | Quantized Language Model |
Architecture | Qwen2ForCausalLM |
Quantization | INT8 (Weights & Activations) |
Developer | Neural Magic |
Release Date | 2/5/2025 |
Model URL | Hugging Face |
What is DeepSeek-R1-Distill-Qwen-7B-quantized.w8a8?
This is an optimized version of the DeepSeek-R1-Distill-Qwen-7B model that uses INT8 quantization for both weights and activations. The quantization reduces memory requirements by approximately 50% and increases computation throughput by up to 2x while maintaining model accuracy. The model achieves up to 1.6x speedup in both single-stream and multi-stream asynchronous deployment scenarios.
Implementation Details
The model employs sophisticated quantization techniques, including symmetric per-channel quantization for weights and symmetric per-token quantization for activations. The GPTQ algorithm is used for quantization implementation through the llm-compressor library. Only the linear operators within transformer blocks are quantized, preserving the model's core functionality.
- 50% reduction in GPU memory usage
- 2x increase in matrix multiplication compute throughput
- 50% reduction in disk storage requirements
- Maintains 100.74% average accuracy on reasoning tasks compared to the original model
- Compatible with vLLM backend for efficient deployment
Core Capabilities
- Strong performance in reasoning tasks (66.28 average score)
- Excellent mathematical ability (93% on MATH-500)
- Robust coding capabilities (39.50% pass@1 on HumanEval)
- Efficient RAG processing with reduced latency
- Optimized for both single-stream and multi-stream inference
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for achieving significant performance improvements through quantization while maintaining and sometimes exceeding the original model's accuracy. It's particularly notable for its balanced trade-off between efficiency and performance, making it ideal for production deployments.
Q: What are the recommended use cases?
The model excels in scenarios requiring efficient inference, including instruction following, multi-turn chat, code generation, and RAG applications. It's particularly well-suited for deployment on resource-constrained systems or when optimizing for cost-efficiency in cloud environments.