Llama-2-13B-chat-GGML

Property	Value
Parameter Count	13 Billion
Model Type	Chat-optimized Language Model
Architecture	Llama 2
License	Meta Custom License
Research Paper	Llama 2 Paper

What is Llama-2-13B-chat-GGML?

Llama-2-13B-chat-GGML is a converted version of Meta's Llama 2 13B chat model, optimized for CPU and GPU inference using the GGML format. This model represents a middle ground in the Llama 2 family, offering a balance between performance and resource requirements. It's specifically fine-tuned for dialogue applications and comes with multiple quantization options to suit different hardware configurations.

Implementation Details

The model is available in various quantization levels, from 2-bit to 8-bit, allowing users to balance between model size and performance. For example, the q4_K_M variant offers an excellent compromise at 7.87GB model size with 10.37GB RAM requirement. The model implements the latest k-quant methods for optimal performance.

Context length: 4096 tokens standard (expandable with RoPE scaling)
Multiple quantization options (q2_K through q8_0)
Supports GPU layer offloading for improved performance
Compatible with llama.cpp and various UI implementations

Core Capabilities

Optimized for dialogue and chat applications
Strong performance in helpfulness and safety benchmarks
Scores 54.8 on MMLU (13B version)
Enhanced truthfulness with 62.18% score on TruthfulQA
Zero toxicity rating in safety evaluations

Frequently Asked Questions

Q: What makes this model unique?

This GGML version allows efficient CPU/GPU inference with various quantization options, making it accessible for consumer hardware while maintaining the quality of the original Llama 2 model.

Q: What are the recommended use cases?

The model excels in assistant-like chat applications, text generation, and general dialogue tasks. It's particularly suitable for deployment in scenarios where a balance between performance and resource usage is crucial.