Wizard Vicuna 13B GGML

Property	Value
Author	TheBloke
License	Other
Base Model	June Lee's Wizard Vicuna 13B
Format	GGML (CPU+GPU inference)

What is wizard-vicuna-13B-GGML?

Wizard Vicuna 13B GGML is a quantized version of June Lee's Wizard Vicuna model, specifically optimized for CPU and GPU inference using llama.cpp. It comes in multiple quantization levels ranging from 2-bit to 8-bit, offering different trade-offs between model size, memory usage, and inference speed.

Implementation Details

The model is available in various quantization formats, including both original llama.cpp methods (q4_0, q4_1, q5_0, q5_1, q8_0) and new k-quant methods (q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K). Model sizes range from 5.43GB (q2_K) to 13.83GB (q8_0), with corresponding RAM requirements.

Supports multiple inference frameworks including text-generation-webui, KoboldCpp, and llama-cpp-python
Offers GPU layer offloading for reduced RAM usage
Compatible with latest llama.cpp developments

Core Capabilities

Efficient CPU+GPU inference with various memory/performance trade-offs
Multiple quantization options for different use-cases
Support for context window of 2048 tokens
Integration with popular inference frameworks

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its variety of quantization options, allowing users to choose the optimal balance between model size, memory usage, and inference speed for their specific hardware setup.

Q: What are the recommended use cases?

The model is ideal for users who need to run large language models on consumer hardware, particularly those with limited GPU memory or those who prefer CPU inference. The different quantization options make it versatile for various hardware configurations.