LLaMA2 70B Chat Uncensored GGML

Property	Value
Base Model	LLaMA2 70B
Format	GGML (Deprecated)
License	LLaMA2
Paper	QLoRA (arxiv:2305.14314)
Training Dataset	wizard_vicuna_70k_unfiltered

What is llama2_70b_chat_uncensored-GGML?

This is a quantized version of the uncensored LLaMA2 70B chat model, specifically formatted in GGML for efficient CPU and GPU inference. The model was fine-tuned using QLoRA on an unfiltered conversation dataset to provide more direct, unrestricted responses compared to the base LLaMA2 model.

Implementation Details

The model is available in multiple quantization levels (Q2_K through Q5_K_M) offering different trade-offs between model size (28.59GB to 48.75GB) and performance. It requires the use of '-gqa 8' argument for proper functionality and supports various inference frameworks including llama.cpp, text-generation-webui, and KoboldCpp.

Multiple quantization options for different hardware capabilities
Supports GPU acceleration with both CUDA and Metal
Context window of 4096 tokens
Uses Human-Response prompt template

Core Capabilities

Provides straightforward, unfiltered responses
Maintains high accuracy while reducing model size through quantization
Supports partial GPU offloading for optimal performance
Compatible with major GGML inference frameworks

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its uncensored approach to responses, providing direct answers without the typical AI safety filters, while maintaining the powerful capabilities of the 70B parameter architecture.

Q: What are the recommended use cases?

The model is suited for applications requiring direct, unfiltered responses, though users should note that GGML format is now deprecated in favor of GGUF format. It's particularly useful for CPU+GPU inference scenarios where straightforward interactions are preferred.