Falcon Mamba 7B

Property	Value
Parameter Count	7.27B
Architecture Type	Mamba (Attention-free)
Training Data	5,500B tokens
License	TII Falcon-Mamba License 2.0
Paper	arXiv:2410.05355

What is falcon-mamba-7b?

Falcon-Mamba-7B represents a groundbreaking advancement in language models, being the first competitive attention-free 7B parameter model. Developed by TII, it utilizes the innovative Mamba architecture to achieve performance comparable to traditional transformer-based models while potentially offering better efficiency and scalability.

Implementation Details

The model features 64 layers with a hidden dimension of 4096 and state dimension of 16. It was trained on a massive dataset of approximately 5,500B tokens, primarily from Refined-Web, using 256 H100 80GB GPUs. The training process employed a sophisticated multi-stage strategy with curriculum learning to handle increasing context lengths from 2,048 to 8,192 tokens.

Trained using bfloat16 precision
Uses AdamW optimizer with WSD learning rate schedule
Maximum learning rate of 6.4e-4
Batch size of 2048

Core Capabilities

Strong performance on reasoning tasks (33.36% on IFEval)
Competitive math solving abilities (3.63% on MATH Level 5)
Efficient text generation without attention mechanisms
Support for various precision formats (FP16, 4-bit quantization)

Frequently Asked Questions

Q: What makes this model unique?

It's the first competitive language model of its size to achieve strong performance without using attention mechanisms, potentially offering better scaling properties and efficiency compared to traditional transformer models.

Q: What are the recommended use cases?

The model excels in general text generation tasks, reasoning, and mathematical problem-solving. It's particularly suitable for applications requiring efficient processing of long sequences due to its attention-free architecture.

falcon-mamba-7b