Llama2 7B Chat Uncensored GGML

Property	Value
Base Model	LLaMA 2 7B
Model Type	Chat/Conversational
License	Other
Format	GGML (Various quantizations)

What is llama2_7b_chat_uncensored-GGML?

This is a quantized version of George Sung's uncensored LLaMA 2 chat model, specifically optimized for CPU and GPU inference using the GGML format. The model was fine-tuned on the wizard_vicuna_70k_unfiltered dataset using QLoRA techniques, offering an uncensored variant of the original LLaMA 2 capabilities.

Implementation Details

The model is available in multiple quantization levels ranging from 2-bit to 8-bit, offering different tradeoffs between model size, memory usage, and inference speed. For example, the q4_K_M variant uses 4-bit quantization with optimized k-quant methods, requiring about 4.08GB of storage and 6.58GB of RAM during operation.

Multiple quantization options (2-bit to 8-bit)
Supports both CPU and GPU inference
Uses the Human-Response prompt template
Compatible with various GGML-supporting frameworks

Core Capabilities

Uncensored chat responses
Context window support up to 4096 tokens
Efficient inference on consumer hardware
Flexible deployment options across different platforms

Frequently Asked Questions

Q: What makes this model unique?

This model combines the powerful LLaMA 2 architecture with uncensored training data, while offering highly efficient quantized versions for practical deployment. The various quantization options allow users to choose the optimal balance between model size and performance for their specific use case.

Q: What are the recommended use cases?

The model is particularly suited for applications requiring unrestricted conversation capabilities while operating under hardware constraints. The multiple quantization options make it versatile for deployment on different hardware configurations, from resource-constrained environments to high-performance systems.