Yarn-Mistral-7B-128k-AWQ
Property | Value |
---|---|
Parameter Count | 7 Billion |
Context Length | 128,000 tokens |
License | Apache 2.0 |
Paper | arXiv:2309.00071 |
Quantization | 4-bit AWQ |
What is Yarn-Mistral-7B-128k-AWQ?
Yarn-Mistral-7B-128k-AWQ is a quantized version of the Yarn-extended Mistral language model, optimized for efficient inference while maintaining impressive performance. This model features a massive 128k token context window, making it particularly suitable for processing long documents and complex conversations. The AWQ quantization reduces the model size to 4.15GB while preserving quality comparable to higher-precision versions.
Implementation Details
The model uses state-of-the-art AWQ (Activation-aware Weight Quantization) technology, operating at 4-bit precision with a group size of 128. It's compatible with major inference frameworks including Text Generation WebUI, vLLM, and Hugging Face's TGI.
- Quantization Method: 4-bit AWQ
- Context Length: 128k tokens
- Base Architecture: Mistral-7B
- Model Size: 4.15GB
Core Capabilities
- Long-form text generation with extended context awareness
- Efficient inference with minimal quality degradation
- Improved perplexity metrics across various context lengths
- Maintains competitive performance on standard benchmarks (ARC-c, Hellaswag, MMLU)
Frequently Asked Questions
Q: What makes this model unique?
This model combines the powerful Mistral architecture with YaRN extension method for long context processing, while using AWQ quantization for efficient deployment. It achieves impressive perplexity scores across different context lengths (2.19 at 128k) while maintaining a small deployment footprint.
Q: What are the recommended use cases?
The model excels at tasks requiring long context understanding, such as document analysis, extended conversations, and complex text generation. It's particularly suitable for deployment in resource-constrained environments where efficient inference is crucial.