Qwen-7B-Chat-Int4

Qwen

Qwen-7B-Chat-Int4 is a 4-bit quantized version of the Qwen-7B-Chat model, offering efficient inference with 2.11B parameters while maintaining strong performance across multiple languages and tasks.

Property	Value
Parameter Count	2.11B parameters
Model Type	Quantized Chat Model
Architecture	32 layers, 32 heads, 4096 d_model
License	Tongyi Qianwen License Agreement
Supported Languages	Chinese, English, Multi-lingual

What is Qwen-7B-Chat-Int4?

Qwen-7B-Chat-Int4 is a 4-bit quantized version of the Qwen-7B-Chat model, designed for efficient deployment while maintaining impressive performance. The model is built on a Transformer architecture and has been trained on diverse datasets including web texts, professional books, and code repositories. This quantized version significantly reduces memory usage while preserving most of the original model's capabilities.

Implementation Details

The model implements advanced technical features including RoPE relative position encoding, SwiGLU activation functions, and RMSNorm. It uses a vocabulary of approximately 150K tokens optimized for Chinese, English, and code, built upon GPT-4's BPE vocabulary base cl100k_base.

Architecture: 32 layers, 32 attention heads, 4096 dimension model
Context Length: 8192 tokens
Memory Usage: 8.21GB for encoding 2048 tokens
Inference Speed: 50.09 tokens/s for 2048 tokens with Flash Attention v2

Core Capabilities

Strong performance in Chinese (59.7% on C-Eval) and English (55.8% on MMLU) evaluations
Code generation capabilities with 37.2% Pass@1 on HumanEval
Mathematical reasoning with 50.3% accuracy on GSM8K
Tool usage and ReAct prompting support
Efficient inference with reduced memory footprint

Frequently Asked Questions

Q: What makes this model unique?

The model combines efficient 4-bit quantization with strong multi-lingual capabilities and tool usage abilities, making it particularly suitable for deployment in resource-constrained environments while maintaining high performance.

Q: What are the recommended use cases?

The model excels in multi-lingual chat applications, code generation, mathematical problem-solving, and tool-augmented tasks. It's particularly suitable for deployment scenarios where memory efficiency is crucial.