DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

neuralmagic

8-bit quantized version of DeepSeek-R1-Distill-Qwen-32B offering 2x faster inference with 99.57% accuracy retention and 50% memory reduction

PropertyValue
Model TypeQwen2ForCausalLM
QuantizationINT8 (Weights & Activations)
Release DateFebruary 5, 2025
DeveloperNeural Magic
Model URLhttps://huggingface.co/neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

What is DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8?

This is an optimized version of the DeepSeek-R1-Distill-Qwen-32B model that implements INT8 quantization for both weights and activations. The quantization reduces memory requirements by approximately 50% while maintaining 99.57% of the original model's accuracy. The model achieves up to 1.8x speedup in single-stream deployment and 2.2x speedup in multi-stream asynchronous deployment.

Implementation Details

The model uses symmetric per-channel quantization for weights and symmetric per-token quantization for activations. The GPTQ algorithm is employed for quantization through the llm-compressor library. Only the linear operators within transformer blocks are quantized, preserving the model's core functionality while optimizing resource usage.

  • Reduces GPU memory requirements by ~50%
  • Increases matrix-multiply compute throughput by ~2x
  • Maintains near-original accuracy across benchmarks
  • Compatible with vLLM backend for efficient deployment

Core Capabilities

  • Strong performance in reasoning tasks (94.98% on MATH-500)
  • Excellent coding capabilities (85.80% pass@1 on HumanEval)
  • Robust performance on OpenLLM benchmarks (74.50% average score on V1)
  • Efficient RAG and summarization processing

Frequently Asked Questions

Q: What makes this model unique?

The model combines high-performance quantization with minimal accuracy loss, making it particularly suitable for production deployment where resource efficiency is crucial. It achieves this while maintaining performance across a wide range of tasks, from reasoning to coding.

Q: What are the recommended use cases?

The model excels in scenarios requiring efficient deployment, including instruction following, multi-turn chat, code generation, and RAG applications. It's particularly valuable for deployments where memory constraints or inference speed are critical factors.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026