DeepSeek-R1-int2-mixed-sym-inc

Property	Value
Author	OPEA
Paper	Optimize weight rounding via signed gradient descent for the quantization of LLMs
Model Type	INT2 Quantized Language Model
Implementation	Mixed Precision (2/4/16-bit)

What is DeepSeek-R1-int2-mixed-sym-inc?

DeepSeek-R1-int2-mixed-sym-inc is a highly optimized quantized version of the DeepSeek-R1 language model, utilizing a mixed-precision approach with INT2 quantization as its base. The model employs symmetric quantization with group size 64, while strategically using 4-bit and 16-bit precision for certain layers to maintain performance while significantly reducing model size.

Implementation Details

The model implements an innovative quantization strategy where most layers use 2-bit precision with some layers strategically falling back to 4-bit or 16-bit precision. This mixed-precision approach helps maintain model performance while achieving substantial compression. The model can be deployed on both CPU and CUDA, with CPU implementations showing potentially better accuracy due to overflow protection.

Utilizes INT2 quantization with group size 64
Implements symmetric quantization for weight representation
Strategic fallback to 4-bit and 16-bit precision for critical layers
Supports both CPU and CUDA deployment options

Core Capabilities

Maintains strong performance on MMLU (0.8302 vs 0.8514 for BF16)
Competitive accuracy on ARC challenge (0.6084 vs 0.6212)
Effective on complex reasoning tasks like Hellaswag and Winogrande
Significant model size reduction while preserving functionality

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its innovative mixed-precision quantization approach, achieving extreme compression to INT2 while strategically preserving higher precision where needed. It demonstrates that aggressive quantization can maintain strong performance when properly implemented.

Q: What are the recommended use cases?

The model is well-suited for deployment scenarios where model size is a critical constraint but performance cannot be significantly compromised. It's particularly effective for general language understanding tasks, showing strong performance on benchmarks like MMLU and ARC.