guanaco-65B-GGML

guanaco-65B-GGML

TheBloke

A 65B parameter GGML-quantized version of Guanaco, optimized for CPU+GPU inference with multiple quantization options, based on LLaMA architecture.

PropertyValue
Base ModelLLaMA 65B
LicenseOther (Apache 2 for adapters)
PaperQLoRA Paper
Quantization Options2-bit to 8-bit GGML

What is guanaco-65B-GGML?

Guanaco-65B-GGML is a quantized version of the Guanaco language model, specifically optimized for CPU and GPU inference using the GGML framework. Based on the LLaMA architecture and fine-tuned using QLoRA on the OASST1 dataset, this model offers various quantization options from 2-bit to 8-bit precision to balance performance and resource usage.

Implementation Details

The model comes in multiple quantization variants, ranging from lightweight 2-bit versions (27.33GB) to full 8-bit precision (69.37GB). It implements both traditional llama.cpp quantization methods (q4_0, q4_1, q5_0, q5_1, q8_0) and newer k-quant methods for optimal performance.

  • Supports multiple quantization levels (q2_K through q8_0)
  • Compatible with llama.cpp and various UI frameworks
  • Includes GPU layer offloading capabilities
  • Requires specific prompt template: "### Human: [prompt] ### Assistant:"

Core Capabilities

  • Competitive performance with commercial chatbot systems
  • Multi-lingual response capabilities
  • Efficient CPU+GPU inference with adjustable resource usage
  • Supports context window of 2048 tokens
  • Performs well on MMLU benchmark (62.2% accuracy)

Frequently Asked Questions

Q: What makes this model unique?

This model offers a rare combination of high performance (65B parameters) with efficient quantization options, making it viable to run on consumer hardware through CPU+GPU inference. It provides multiple quantization levels to balance between quality and resource usage.

Q: What are the recommended use cases?

The model is best suited for research purposes and local deployment of high-quality chatbot capabilities. It's particularly useful when you need a powerful language model that can run on limited hardware resources through quantization.

Socials
Integrations
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026