FLM-101B

Property	Value
Parameter Count	101 Billion
Model Type	Decoder-only LLM
Languages	Chinese, English
License	Apache-2.0
Context Window	2048 tokens
Training Cost	~$100,000

What is FLM-101B?

FLM-101B represents a significant advancement in cost-effective large language model development, featuring 101 billion parameters and innovative model growth techniques. Starting from a 16B parameter model, it gradually scaled up to 101B, achieving efficient training at approximately $100,000. The model employs xPos rotary position embedding, enabling flexible context window expansion during inference beyond its 2048-token training window.

Implementation Details

The model architecture features 80 layers, 80 attention heads, and a dimension size of 10240. Training was conducted on a cluster of 24 DGX-A800 GPU servers, utilizing 3D parallelism (DP+TP+PP) and distributed optimization. The implementation includes Flash Attention during training and supports both Chinese and English languages.

Progressive learning with model growth from 16B to 101B
xPos position embedding for efficient context window expansion
Flash Attention implementation for training optimization
Customized vocabulary size of 100,256 tokens

Core Capabilities

Bilingual processing (Chinese and English)
Efficient context window expansion
Cost-effective training methodology
Flexible deployment with PyTorch integration

Frequently Asked Questions

Q: What makes this model unique?

FLM-101B is distinguished by its successful implementation of progressive learning with model growth at the 100B scale, making it the largest known language model to utilize xPos embedding and μp transfer techniques. The model achieved remarkable cost efficiency in training, demonstrating a practical approach to developing large-scale language models.

Q: What are the recommended use cases?

The model is suitable for both Chinese and English language processing tasks, particularly in text generation applications. However, due to its relatively low token count during training, it may have limitations in specialized domains and currently requires optimization for inference speed and resource usage.

FLM-101B

FLM-101B

What is FLM-101B?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models