wizard-vicuna-13B-GGML

Maintained By
TheBloke

Wizard Vicuna 13B GGML

PropertyValue
AuthorTheBloke
LicenseOther
Base ModelJune Lee's Wizard Vicuna 13B
FormatGGML (CPU+GPU inference)

What is wizard-vicuna-13B-GGML?

Wizard Vicuna 13B GGML is a quantized version of June Lee's Wizard Vicuna model, specifically optimized for CPU and GPU inference using llama.cpp. It comes in multiple quantization levels ranging from 2-bit to 8-bit, offering different trade-offs between model size, memory usage, and inference speed.

Implementation Details

The model is available in various quantization formats, including both original llama.cpp methods (q4_0, q4_1, q5_0, q5_1, q8_0) and new k-quant methods (q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K). Model sizes range from 5.43GB (q2_K) to 13.83GB (q8_0), with corresponding RAM requirements.

  • Supports multiple inference frameworks including text-generation-webui, KoboldCpp, and llama-cpp-python
  • Offers GPU layer offloading for reduced RAM usage
  • Compatible with latest llama.cpp developments

Core Capabilities

  • Efficient CPU+GPU inference with various memory/performance trade-offs
  • Multiple quantization options for different use-cases
  • Support for context window of 2048 tokens
  • Integration with popular inference frameworks

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its variety of quantization options, allowing users to choose the optimal balance between model size, memory usage, and inference speed for their specific hardware setup.

Q: What are the recommended use cases?

The model is ideal for users who need to run large language models on consumer hardware, particularly those with limited GPU memory or those who prefer CPU inference. The different quantization options make it versatile for various hardware configurations.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.