Falcon Mamba 7B
Property | Value |
---|---|
Parameter Count | 7.27B |
Architecture Type | Mamba (Attention-free) |
Training Data | 5,500B tokens |
License | TII Falcon-Mamba License 2.0 |
Paper | arXiv:2410.05355 |
What is falcon-mamba-7b?
Falcon-Mamba-7B represents a groundbreaking advancement in language models, being the first competitive attention-free 7B parameter model. Developed by TII, it utilizes the innovative Mamba architecture to achieve performance comparable to traditional transformer-based models while potentially offering better efficiency and scalability.
Implementation Details
The model features 64 layers with a hidden dimension of 4096 and state dimension of 16. It was trained on a massive dataset of approximately 5,500B tokens, primarily from Refined-Web, using 256 H100 80GB GPUs. The training process employed a sophisticated multi-stage strategy with curriculum learning to handle increasing context lengths from 2,048 to 8,192 tokens.
- Trained using bfloat16 precision
- Uses AdamW optimizer with WSD learning rate schedule
- Maximum learning rate of 6.4e-4
- Batch size of 2048
Core Capabilities
- Strong performance on reasoning tasks (33.36% on IFEval)
- Competitive math solving abilities (3.63% on MATH Level 5)
- Efficient text generation without attention mechanisms
- Support for various precision formats (FP16, 4-bit quantization)
Frequently Asked Questions
Q: What makes this model unique?
It's the first competitive language model of its size to achieve strong performance without using attention mechanisms, potentially offering better scaling properties and efficiency compared to traditional transformer models.
Q: What are the recommended use cases?
The model excels in general text generation tasks, reasoning, and mathematical problem-solving. It's particularly suitable for applications requiring efficient processing of long sequences due to its attention-free architecture.