PixArt-XL-2-1024-MS

Property	Value
License	OpenRAIL++
Parameters	0.6B
Training Data	25M images
Training Cost	675 A100 GPU days
Paper	arXiv:2310.00426

What is PixArt-XL-2-1024-MS?

PixArt-XL-2-1024-MS is a revolutionary text-to-image diffusion transformer model that combines exceptional efficiency with high-quality image generation capabilities. It generates 1024px images directly from text prompts in a single sampling process, using pure transformer blocks for latent diffusion.

Implementation Details

The model utilizes T5 for text encoding and a specialized VAE for latent feature encoding. It's implemented using the diffusers library and can be accelerated using torch.compile for 20-30% faster inference on torch >= 2.0. The model achieves its results with significantly less computational resources than competitors, requiring only 675 A100 GPU days compared to SD 1.5's 6,250.

Efficient architecture requiring only 0.6B parameters
Supports high-resolution 1024px image generation
Compatible with various sampling methods including SA-Solver
Includes CPU offloading capabilities for limited VRAM scenarios

Core Capabilities

Direct generation of 1024px images from text
Comparable or better quality than SDXL 0.9 and DALLE-2 in user studies
90% reduction in training costs and CO2 emissions compared to SD 1.5
Efficient memory usage with various optimization options

Frequently Asked Questions

Q: What makes this model unique?

The model's primary distinction is its exceptional efficiency, achieving state-of-the-art results with only 10.8% of Stable Diffusion v1.5's training time and resources while maintaining comparable or better output quality.

Q: What are the recommended use cases?

The model is intended for research purposes, particularly in areas such as artwork generation, educational tools, generative model research, and studying AI safety. It's not intended for generating factual content or true representations of people or events.