LLaMA2 70B Chat Uncensored GGML
Property | Value |
---|---|
Base Model | LLaMA2 70B |
Format | GGML (Deprecated) |
License | LLaMA2 |
Paper | QLoRA (arxiv:2305.14314) |
Training Dataset | wizard_vicuna_70k_unfiltered |
What is llama2_70b_chat_uncensored-GGML?
This is a quantized version of the uncensored LLaMA2 70B chat model, specifically formatted in GGML for efficient CPU and GPU inference. The model was fine-tuned using QLoRA on an unfiltered conversation dataset to provide more direct, unrestricted responses compared to the base LLaMA2 model.
Implementation Details
The model is available in multiple quantization levels (Q2_K through Q5_K_M) offering different trade-offs between model size (28.59GB to 48.75GB) and performance. It requires the use of '-gqa 8' argument for proper functionality and supports various inference frameworks including llama.cpp, text-generation-webui, and KoboldCpp.
- Multiple quantization options for different hardware capabilities
- Supports GPU acceleration with both CUDA and Metal
- Context window of 4096 tokens
- Uses Human-Response prompt template
Core Capabilities
- Provides straightforward, unfiltered responses
- Maintains high accuracy while reducing model size through quantization
- Supports partial GPU offloading for optimal performance
- Compatible with major GGML inference frameworks
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its uncensored approach to responses, providing direct answers without the typical AI safety filters, while maintaining the powerful capabilities of the 70B parameter architecture.
Q: What are the recommended use cases?
The model is suited for applications requiring direct, unfiltered responses, though users should note that GGML format is now deprecated in favor of GGUF format. It's particularly useful for CPU+GPU inference scenarios where straightforward interactions are preferred.