TinyLlama-1.1B-intermediate-step-1431k-3T

TinyLlama

A compact 1.1B parameter LLaMA-based model trained on 3T tokens, achieving strong performance with normalized accuracy of 60.31% on HellaSwag benchmark.

Property	Value
Parameter Count	1.1B
License	Apache 2.0
Training Tokens	3 Trillion
Architecture	LLaMA-based Transformer

What is TinyLlama-1.1B-intermediate-step-1431k-3T?

TinyLlama-1.1B is an ambitious project aimed at creating a compact yet powerful language model by pretraining a 1.1B parameter model on 3 trillion tokens. This specific checkpoint represents the final stage of training, achieving impressive performance metrics while maintaining a small computational footprint.

Implementation Details

The model adopts the same architecture and tokenizer as Llama 2, making it highly compatible with existing Llama-based projects. It was trained using 16 A100-40G GPUs over a 90-day period, demonstrating efficient resource utilization for large-scale training.

Identical architecture to Llama 2 for seamless integration
Trained on SlimPajama-627B and StarCoder datasets
Optimized for both performance and memory efficiency
Uses F32 tensor type for computations

Core Capabilities

HellaSwag (10-Shot): 60.31% normalized accuracy
Winogrande (5-shot): 59.51% accuracy
TruthfulQA (0-shot): 37.32% accuracy
MMLU (5-Shot): 26.04% accuracy
Efficient performance in resource-constrained environments

Frequently Asked Questions

Q: What makes this model unique?

TinyLlama stands out for achieving impressive performance metrics with only 1.1B parameters, making it significantly more efficient than larger models while maintaining strong capabilities. Its compatibility with the Llama ecosystem makes it particularly valuable for resource-constrained applications.

Q: What are the recommended use cases?

The model is ideal for applications requiring a balance between performance and computational efficiency, such as edge devices, rapid prototyping, and scenarios where larger models would be impractical. It's particularly well-suited for text generation tasks where resource constraints are a primary concern.