Yarn-Mistral-7B-128k-AWQ

Property	Value
Parameter Count	7 Billion
Context Length	128,000 tokens
License	Apache 2.0
Paper	arXiv:2309.00071
Quantization	4-bit AWQ

What is Yarn-Mistral-7B-128k-AWQ?

Yarn-Mistral-7B-128k-AWQ is a quantized version of the Yarn-extended Mistral language model, optimized for efficient inference while maintaining impressive performance. This model features a massive 128k token context window, making it particularly suitable for processing long documents and complex conversations. The AWQ quantization reduces the model size to 4.15GB while preserving quality comparable to higher-precision versions.

Implementation Details

The model uses state-of-the-art AWQ (Activation-aware Weight Quantization) technology, operating at 4-bit precision with a group size of 128. It's compatible with major inference frameworks including Text Generation WebUI, vLLM, and Hugging Face's TGI.

Quantization Method: 4-bit AWQ
Context Length: 128k tokens
Base Architecture: Mistral-7B
Model Size: 4.15GB

Core Capabilities

Long-form text generation with extended context awareness
Efficient inference with minimal quality degradation
Improved perplexity metrics across various context lengths
Maintains competitive performance on standard benchmarks (ARC-c, Hellaswag, MMLU)

Frequently Asked Questions

Q: What makes this model unique?

This model combines the powerful Mistral architecture with YaRN extension method for long context processing, while using AWQ quantization for efficient deployment. It achieves impressive perplexity scores across different context lengths (2.19 at 128k) while maintaining a small deployment footprint.

Q: What are the recommended use cases?

The model excels at tasks requiring long context understanding, such as document analysis, extended conversations, and complex text generation. It's particularly suitable for deployment in resource-constrained environments where efficient inference is crucial.