LongLLaMA 3B

Property	Value
Parameter Count	3.43B parameters
License	Apache 2.0
Research Paper	Focused Transformer: Contrastive Training for Context Scaling
Training Data	RedPajama-Data-1T
Context Length	Up to 256k tokens

What is long_llama_3b?

LongLLaMA 3B is an innovative language model that pushes the boundaries of context length handling in transformer architectures. Built upon OpenLLaMA and fine-tuned using the Focused Transformer (FoT) method, this model can process inputs of up to 256,000 tokens or more, significantly exceeding traditional context limitations.

Implementation Details

The model implements the Focused Transformer architecture with three specific memory layers (6, 12, and 18) for context extension. It utilizes a unique contrastive training approach where memory attention layers are exposed to both relevant and irrelevant keys, enabling better semantic differentiation and context length extrapolation.

Built on OpenLLaMA 3B base model
Trained on 1T tokens (base) + 10B tokens (fine-tuning)
Supports both F32 and BF16 tensor types
Implements automatic context window splitting for long inputs

Core Capabilities

Handles extremely long context lengths (up to 256k tokens)
Maintains performance parity with original OpenLLaMA on standard benchmarks
Improved performance on long-context tasks like TREC and WebQS
Drop-in replacement for standard LLaMA implementations

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its ability to handle extremely long contexts through the Focused Transformer architecture, while maintaining performance on standard tasks. It achieves this without requiring training on full-length sequences.

Q: What are the recommended use cases?

The model excels in tasks requiring long context processing, such as document analysis, long-form text generation, and question-answering over extended contexts. It's particularly useful for applications needing to process or generate text beyond traditional context windows.

long_llama_3b