DeepSeek-R1-AWQ
Property | Value |
---|---|
Model Type | Quantized Language Model |
Author | cognitivecomputations |
Repository | Hugging Face |
Maximum Context Length | 65536 tokens |
What is DeepSeek-R1-AWQ?
DeepSeek-R1-AWQ is a quantized version of the DeepSeek R1 model, specifically optimized for efficient inference while maintaining model quality. This implementation features modified model code to address float16 overflow issues, making it more stable and reliable for production deployments.
Implementation Details
The model has been optimized using AWQ (Activation-aware Weight Quantization) and includes several technical improvements:
- Modified codebase to prevent float16 overflow issues
- Supports deployment on 8x 80GB GPUs
- Achieves 48 TPS on 8x H100 GPUs
- Delivers 38 TPS on 8x A100 GPUs
- Compatible with vLLM's MLA feature for full context length support
Core Capabilities
- Efficient inference with low batch sizes
- Support for maximum context length of 65536 tokens
- Optimized for multi-GPU deployment
- Enhanced stability with float16 operations
- Integrated vLLM support for improved throughput
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its optimized quantization approach that maintains performance while enabling efficient deployment on multiple GPUs. The modified codebase specifically addresses float16 overflow issues, making it more reliable for production use.
Q: What are the recommended use cases?
The model is particularly well-suited for scenarios requiring efficient inference with low batch sizes, especially when working with long context lengths up to 65536 tokens. It's optimized for deployment on high-end GPU clusters, particularly 8x 80GB configurations.