FLM-101B
Property | Value |
---|---|
Parameter Count | 101 Billion |
Model Type | Decoder-only LLM |
Languages | Chinese, English |
License | Apache-2.0 |
Context Window | 2048 tokens |
Training Cost | ~$100,000 |
What is FLM-101B?
FLM-101B represents a significant advancement in cost-effective large language model development, featuring 101 billion parameters and innovative model growth techniques. Starting from a 16B parameter model, it gradually scaled up to 101B, achieving efficient training at approximately $100,000. The model employs xPos rotary position embedding, enabling flexible context window expansion during inference beyond its 2048-token training window.
Implementation Details
The model architecture features 80 layers, 80 attention heads, and a dimension size of 10240. Training was conducted on a cluster of 24 DGX-A800 GPU servers, utilizing 3D parallelism (DP+TP+PP) and distributed optimization. The implementation includes Flash Attention during training and supports both Chinese and English languages.
- Progressive learning with model growth from 16B to 101B
- xPos position embedding for efficient context window expansion
- Flash Attention implementation for training optimization
- Customized vocabulary size of 100,256 tokens
Core Capabilities
- Bilingual processing (Chinese and English)
- Efficient context window expansion
- Cost-effective training methodology
- Flexible deployment with PyTorch integration
Frequently Asked Questions
Q: What makes this model unique?
FLM-101B is distinguished by its successful implementation of progressive learning with model growth at the 100B scale, making it the largest known language model to utilize xPos embedding and μp transfer techniques. The model achieved remarkable cost efficiency in training, demonstrating a practical approach to developing large-scale language models.
Q: What are the recommended use cases?
The model is suitable for both Chinese and English language processing tasks, particularly in text generation applications. However, due to its relatively low token count during training, it may have limitations in specialized domains and currently requires optimization for inference speed and resource usage.