Wizard Vicuna 13B GGML
Property | Value |
---|---|
Author | TheBloke |
License | Other |
Base Model | June Lee's Wizard Vicuna 13B |
Format | GGML (CPU+GPU inference) |
What is wizard-vicuna-13B-GGML?
Wizard Vicuna 13B GGML is a quantized version of June Lee's Wizard Vicuna model, specifically optimized for CPU and GPU inference using llama.cpp. It comes in multiple quantization levels ranging from 2-bit to 8-bit, offering different trade-offs between model size, memory usage, and inference speed.
Implementation Details
The model is available in various quantization formats, including both original llama.cpp methods (q4_0, q4_1, q5_0, q5_1, q8_0) and new k-quant methods (q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K). Model sizes range from 5.43GB (q2_K) to 13.83GB (q8_0), with corresponding RAM requirements.
- Supports multiple inference frameworks including text-generation-webui, KoboldCpp, and llama-cpp-python
- Offers GPU layer offloading for reduced RAM usage
- Compatible with latest llama.cpp developments
Core Capabilities
- Efficient CPU+GPU inference with various memory/performance trade-offs
- Multiple quantization options for different use-cases
- Support for context window of 2048 tokens
- Integration with popular inference frameworks
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its variety of quantization options, allowing users to choose the optimal balance between model size, memory usage, and inference speed for their specific hardware setup.
Q: What are the recommended use cases?
The model is ideal for users who need to run large language models on consumer hardware, particularly those with limited GPU memory or those who prefer CPU inference. The different quantization options make it versatile for various hardware configurations.