Guanaco-65B-GGML
Property | Value |
---|---|
Base Model | LLaMA 65B |
License | Other (Apache 2 for adapters) |
Paper | QLoRA Paper |
Quantization Options | 2-bit to 8-bit GGML |
What is guanaco-65B-GGML?
Guanaco-65B-GGML is a quantized version of the Guanaco language model, specifically optimized for CPU and GPU inference using the GGML framework. Based on the LLaMA architecture and fine-tuned using QLoRA on the OASST1 dataset, this model offers various quantization options from 2-bit to 8-bit precision to balance performance and resource usage.
Implementation Details
The model comes in multiple quantization variants, ranging from lightweight 2-bit versions (27.33GB) to full 8-bit precision (69.37GB). It implements both traditional llama.cpp quantization methods (q4_0, q4_1, q5_0, q5_1, q8_0) and newer k-quant methods for optimal performance.
- Supports multiple quantization levels (q2_K through q8_0)
- Compatible with llama.cpp and various UI frameworks
- Includes GPU layer offloading capabilities
- Requires specific prompt template: "### Human: [prompt] ### Assistant:"
Core Capabilities
- Competitive performance with commercial chatbot systems
- Multi-lingual response capabilities
- Efficient CPU+GPU inference with adjustable resource usage
- Supports context window of 2048 tokens
- Performs well on MMLU benchmark (62.2% accuracy)
Frequently Asked Questions
Q: What makes this model unique?
This model offers a rare combination of high performance (65B parameters) with efficient quantization options, making it viable to run on consumer hardware through CPU+GPU inference. It provides multiple quantization levels to balance between quality and resource usage.
Q: What are the recommended use cases?
The model is best suited for research purposes and local deployment of high-quality chatbot capabilities. It's particularly useful when you need a powerful language model that can run on limited hardware resources through quantization.