Llama-2-13B-chat-GGML
Property | Value |
---|---|
Parameter Count | 13 Billion |
Model Type | Chat-optimized Language Model |
Architecture | Llama 2 |
License | Meta Custom License |
Research Paper | Llama 2 Paper |
What is Llama-2-13B-chat-GGML?
Llama-2-13B-chat-GGML is a converted version of Meta's Llama 2 13B chat model, optimized for CPU and GPU inference using the GGML format. This model represents a middle ground in the Llama 2 family, offering a balance between performance and resource requirements. It's specifically fine-tuned for dialogue applications and comes with multiple quantization options to suit different hardware configurations.
Implementation Details
The model is available in various quantization levels, from 2-bit to 8-bit, allowing users to balance between model size and performance. For example, the q4_K_M variant offers an excellent compromise at 7.87GB model size with 10.37GB RAM requirement. The model implements the latest k-quant methods for optimal performance.
- Context length: 4096 tokens standard (expandable with RoPE scaling)
- Multiple quantization options (q2_K through q8_0)
- Supports GPU layer offloading for improved performance
- Compatible with llama.cpp and various UI implementations
Core Capabilities
- Optimized for dialogue and chat applications
- Strong performance in helpfulness and safety benchmarks
- Scores 54.8 on MMLU (13B version)
- Enhanced truthfulness with 62.18% score on TruthfulQA
- Zero toxicity rating in safety evaluations
Frequently Asked Questions
Q: What makes this model unique?
This GGML version allows efficient CPU/GPU inference with various quantization options, making it accessible for consumer hardware while maintaining the quality of the original Llama 2 model.
Q: What are the recommended use cases?
The model excels in assistant-like chat applications, text generation, and general dialogue tasks. It's particularly suitable for deployment in scenarios where a balance between performance and resource usage is crucial.