Llama-3.1-Minitron-4B-Width-Base
Property | Value |
---|---|
Parameter Count | 4.51B |
Model Type | Transformer Decoder (Auto-Regressive) |
Architecture | Llama-3.1 with GQA and RoPE |
License | NVIDIA Open Model License |
Research Paper | Technical Report |
Training Period | July 29, 2024 - Aug 3, 2024 |
What is Llama-3.1-Minitron-4B-Width-Base?
Llama-3.1-Minitron-4B-Width-Base is an innovative language model developed by NVIDIA through a sophisticated pruning process of the larger Llama-3.1-8B model. It represents a significant achievement in model efficiency, maintaining robust performance while reducing computational requirements through strategic dimension reduction in both embeddings and MLP layers.
Implementation Details
The model features a carefully optimized architecture with 3072 embedding dimensions, 32 attention heads, and a 9216-dimensional MLP intermediate layer across 32 transformer layers. It employs advanced techniques like Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE) for enhanced performance.
- BFloat16 precision for optimal performance-efficiency balance
- Supports input lengths up to 8k characters
- Trained on 94 billion tokens using distillation techniques
- Compatible with NVIDIA Ampere, Blackwell, Hopper, and Lovelace architectures
Core Capabilities
- Achieves 60.5 on Massive Multitask Language Understanding (5-shot)
- Strong zero-shot performance: 76.1 on HellaSwag, 73.5 on Winogrande
- 41.2 score on GSM8K for mathematical reasoning
- 32.0 score on MBPP for code generation tasks
- Multilingual support with emphasis on English content
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its efficient architecture achieved through systematic width pruning and knowledge distillation, making it particularly suitable for commercial applications while maintaining strong performance metrics.
Q: What are the recommended use cases?
The model excels in various natural language generation tasks, particularly within commercial environments. It's optimized for text generation, comprehension, and code-related tasks, with effective performance in both zero-shot and few-shot scenarios.