DeepSeek-R1-int2-mixed-sym-inc
Property | Value |
---|---|
Author | OPEA |
Paper | Optimize weight rounding via signed gradient descent for the quantization of LLMs |
Model Type | INT2 Quantized Language Model |
Implementation | Mixed Precision (2/4/16-bit) |
What is DeepSeek-R1-int2-mixed-sym-inc?
DeepSeek-R1-int2-mixed-sym-inc is a highly optimized quantized version of the DeepSeek-R1 language model, utilizing a mixed-precision approach with INT2 quantization as its base. The model employs symmetric quantization with group size 64, while strategically using 4-bit and 16-bit precision for certain layers to maintain performance while significantly reducing model size.
Implementation Details
The model implements an innovative quantization strategy where most layers use 2-bit precision with some layers strategically falling back to 4-bit or 16-bit precision. This mixed-precision approach helps maintain model performance while achieving substantial compression. The model can be deployed on both CPU and CUDA, with CPU implementations showing potentially better accuracy due to overflow protection.
- Utilizes INT2 quantization with group size 64
- Implements symmetric quantization for weight representation
- Strategic fallback to 4-bit and 16-bit precision for critical layers
- Supports both CPU and CUDA deployment options
Core Capabilities
- Maintains strong performance on MMLU (0.8302 vs 0.8514 for BF16)
- Competitive accuracy on ARC challenge (0.6084 vs 0.6212)
- Effective on complex reasoning tasks like Hellaswag and Winogrande
- Significant model size reduction while preserving functionality
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its innovative mixed-precision quantization approach, achieving extreme compression to INT2 while strategically preserving higher precision where needed. It demonstrates that aggressive quantization can maintain strong performance when properly implemented.
Q: What are the recommended use cases?
The model is well-suited for deployment scenarios where model size is a critical constraint but performance cannot be significantly compromised. It's particularly effective for general language understanding tasks, showing strong performance on benchmarks like MMLU and ARC.