mpt-1b-redpajama-200b

mpt-1b-redpajama-200b

mosaicml

MPT-1b-RedPajama-200b is a 1.3B parameter decoder-only transformer trained on RedPajama dataset for 200B tokens, utilizing advanced features like FlashAttention and ALIBI.

PropertyValue
Parameter Count1.3 Billion
LicenseApache 2.0
Release DateApril 20, 2023
Training Infrastructure440 A100-40GB GPUs
Architecture24 layers, 16 attention heads, width 2048

What is mpt-1b-redpajama-200b?

MPT-1b-RedPajama-200b is a sophisticated decoder-only transformer model developed by MosaicML, trained on the comprehensive RedPajama dataset. This model represents a significant advancement in efficient language model architecture, trained for 200B tokens using a carefully curated mix of data sources mirroring the successful Llama series of models.

Implementation Details

The model implements several cutting-edge technical features that set it apart from standard transformer architectures. It utilizes the MosaicML LLM codebase and incorporates advanced optimization techniques for improved performance.

  • Employs ALiBi positional encoding instead of traditional positional embeddings
  • Implements QK LayerNorm for enhanced stability
  • Operates without traditional transformer biases
  • Supports FlashAttention with Triton implementation for optimization
  • Uses the EleutherAI/gpt-neox-20b tokenizer

Core Capabilities

  • Efficient text generation with optimized attention mechanisms
  • Handles diverse content types due to varied training data (CommonCrawl, GitHub, Wikipedia, etc.)
  • Supports both CPU and GPU inference with bfloat16 optimization
  • Scalable deployment with FSDP sharding support

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness stems from its efficient architecture combining ALiBi, QK LayerNorm, and FlashAttention, trained on a carefully balanced dataset mix matching the Llama training distribution.

Q: What are the recommended use cases?

The model is well-suited for general text generation tasks, research applications, and scenarios requiring efficient transformer-based language processing. It's particularly effective when implemented with its optimized Triton FlashAttention feature.

Related Models

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026