Falcon-Mamba-7B-Instruct
Property | Value |
---|---|
Parameter Count | 7.27B |
Model Type | Causal decoder-only |
Architecture | Mamba SSM |
License | TII Falcon-Mamba License 2.0 |
Paper | arXiv:2410.05355 |
What is falcon-mamba-7b-instruct?
Falcon-Mamba-7B-Instruct is a groundbreaking language model that represents a significant departure from traditional transformer architectures. Developed by TII, it's one of the first competitive models to completely eliminate attention mechanisms, instead utilizing the innovative Mamba architecture for sequence modeling. The model was trained on approximately 5,500 GT of data, primarily from Refined-Web, and has been fine-tuned for instruction-following tasks.
Implementation Details
The model features 64 layers with a hidden dimension of 4096 and a state dimension of 16. It was trained on 256 H100 80GB GPUs using a sophisticated 3D parallelism strategy. The training process incorporated a multi-stage approach, with careful consideration given to curriculum learning principles.
- Trained using AdamW optimizer with a maximum learning rate of 6.4e-4
- Implements WSD (warmup-stable-decay) learning rate schedule
- Uses BatchScaling during rampup phase
- Supports context lengths up to 8,192 tokens
Core Capabilities
- Achieves competitive performance on standard benchmarks (64.09% average on key metrics)
- Excels in tasks like ARC (62.03%) and MMLU (62.11%)
- Supports efficient inference with various precision options (FP16, 4-bit quantization)
- Demonstrates strong performance in technical and mathematical reasoning tasks
Frequently Asked Questions
Q: What makes this model unique?
This is one of the first successful implementations of a pure SSM (State Space Model) architecture at scale, achieving comparable performance to transformer models without using attention mechanisms. This makes it particularly interesting for both research and practical applications.
Q: What are the recommended use cases?
The model is particularly well-suited for instruction-following tasks, technical content generation, and general language understanding tasks. It can be deployed in various configurations from full precision to 4-bit quantization, making it adaptable to different computational resources.