Yarn-Llama-2-70b-32k-2.4bpw-h6-exl2

Property	Value
Base Model	LLaMA-2-70B
Context Window	32,000 tokens
License	Apache 2.0
Paper	arXiv:2309.00071

What is Yarn-Llama-2-70b-32k-2.4bpw-h6-exl2?

This is a state-of-the-art language model that extends the capabilities of LLaMA-2-70B with significantly improved context handling. It has been further pretrained for 400 steps using the YaRN extension method, enabling it to process up to 32,000 tokens of context - a substantial improvement over the original 4,000 token limit.

Implementation Details

The model requires specific implementation parameters, including the use of Flash Attention 2 and bfloat16 precision. It was trained on the JUWELS supercomputer with support from LAION AI. The model demonstrates impressive perplexity metrics across various context lengths, from 3.61 at 1k tokens to 2.23 at 32k tokens.

Requires trust_remote_code=True parameter
Utilizes Flash Attention 2 for efficient processing
Implements bfloat16 precision for optimal performance
Compatible with the latest transformers library

Core Capabilities

Extended context window of 32k tokens
Maintained performance on standard benchmarks (ARC-c: 67.41, MMLU: 68.84)
Improved long-context processing with minimal quality degradation
Enhanced truthful QA capabilities (46.14 vs base model's 44.92)

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its ability to handle extremely long contexts (32k tokens) while maintaining or improving performance across various benchmarks compared to the base LLaMA-2-70B model. This makes it particularly suitable for tasks requiring extensive context processing.

Q: What are the recommended use cases?

The model is ideal for applications requiring long-form content analysis, document processing, and complex reasoning tasks that benefit from extended context windows. It's particularly well-suited for tasks like document summarization, long-form question answering, and analysis of extensive text passages.