Llama-2-70B-GGML

Maintained By
TheBloke

Llama-2-70B-GGML

PropertyValue
Base ModelLlama 2 70B
LicenseMeta Custom License
PaperLlama 2 Paper
Context Length4K tokens
Training Data2T tokens

What is Llama-2-70B-GGML?

Llama-2-70B-GGML is a GGML-optimized version of Meta's powerful 70B parameter language model, specifically converted for efficient CPU inference. This implementation offers various quantization options ranging from 2-bit to 8-bit precision, allowing users to balance between model size, performance, and resource requirements.

Implementation Details

The model comes in multiple quantization variants, from the lightweight q2_K (28.59GB) to the high-precision q8_0 (73.23GB). It requires the latest version of llama.cpp (commit e76d630 or later) and includes special optimizations like Grouped-Query Attention (GQA) with parameter -gqa 8.

  • Multiple quantization options (q2_K to q8_0)
  • Supports CPU inference with llama.cpp
  • Requires 31GB to 75GB RAM depending on quantization
  • Implements advanced k-quant methods for optimal performance

Core Capabilities

  • Strong performance across various benchmarks (68.9% on MMLU)
  • Efficient CPU inference with multiple precision options
  • 4K token context window
  • Supports both general text generation and completion tasks

Frequently Asked Questions

Q: What makes this model unique?

This implementation stands out for its optimization for CPU deployment through GGML format and various quantization options, making the powerful 70B parameter model accessible on consumer hardware.

Q: What are the recommended use cases?

The model is best suited for general text generation tasks, research applications, and development of AI applications where CPU deployment is preferred over GPU. It's particularly useful when balancing between model performance and resource constraints.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.