DeepSeek-R1-AWQ

Maintained By
cognitivecomputations

DeepSeek-R1-AWQ

PropertyValue
Model TypeQuantized Language Model
Authorcognitivecomputations
RepositoryHugging Face
Maximum Context Length65536 tokens

What is DeepSeek-R1-AWQ?

DeepSeek-R1-AWQ is a quantized version of the DeepSeek R1 model, specifically optimized for efficient inference while maintaining model quality. This implementation features modified model code to address float16 overflow issues, making it more stable and reliable for production deployments.

Implementation Details

The model has been optimized using AWQ (Activation-aware Weight Quantization) and includes several technical improvements:

  • Modified codebase to prevent float16 overflow issues
  • Supports deployment on 8x 80GB GPUs
  • Achieves 48 TPS on 8x H100 GPUs
  • Delivers 38 TPS on 8x A100 GPUs
  • Compatible with vLLM's MLA feature for full context length support

Core Capabilities

  • Efficient inference with low batch sizes
  • Support for maximum context length of 65536 tokens
  • Optimized for multi-GPU deployment
  • Enhanced stability with float16 operations
  • Integrated vLLM support for improved throughput

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its optimized quantization approach that maintains performance while enabling efficient deployment on multiple GPUs. The modified codebase specifically addresses float16 overflow issues, making it more reliable for production use.

Q: What are the recommended use cases?

The model is particularly well-suited for scenarios requiring efficient inference with low batch sizes, especially when working with long context lengths up to 65536 tokens. It's optimized for deployment on high-end GPU clusters, particularly 8x 80GB configurations.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.