Guanaco-65B-GGML

Property	Value
Base Model	LLaMA 65B
License	Other (Apache 2 for adapters)
Paper	QLoRA Paper
Quantization Options	2-bit to 8-bit GGML

What is guanaco-65B-GGML?

Guanaco-65B-GGML is a quantized version of the Guanaco language model, specifically optimized for CPU and GPU inference using the GGML framework. Based on the LLaMA architecture and fine-tuned using QLoRA on the OASST1 dataset, this model offers various quantization options from 2-bit to 8-bit precision to balance performance and resource usage.

Implementation Details

The model comes in multiple quantization variants, ranging from lightweight 2-bit versions (27.33GB) to full 8-bit precision (69.37GB). It implements both traditional llama.cpp quantization methods (q4_0, q4_1, q5_0, q5_1, q8_0) and newer k-quant methods for optimal performance.

Supports multiple quantization levels (q2_K through q8_0)
Compatible with llama.cpp and various UI frameworks
Includes GPU layer offloading capabilities
Requires specific prompt template: "### Human: [prompt] ### Assistant:"

Core Capabilities

Competitive performance with commercial chatbot systems
Multi-lingual response capabilities
Efficient CPU+GPU inference with adjustable resource usage
Supports context window of 2048 tokens
Performs well on MMLU benchmark (62.2% accuracy)

Frequently Asked Questions

Q: What makes this model unique?

This model offers a rare combination of high performance (65B parameters) with efficient quantization options, making it viable to run on consumer hardware through CPU+GPU inference. It provides multiple quantization levels to balance between quality and resource usage.

Q: What are the recommended use cases?

The model is best suited for research purposes and local deployment of high-quality chatbot capabilities. It's particularly useful when you need a powerful language model that can run on limited hardware resources through quantization.

guanaco-65B-GGML