MicroDiT

Maintained By
VSehwag24

MicroDiT

PropertyValue
Parameter Count1.16 billion
Training Cost$1,890
LicenseApache 2.0
PaperarXiv:2407.15811

What is MicroDiT?

MicroDiT is a groundbreaking text-to-image diffusion transformer model that challenges the notion that high-quality AI models require massive computational resources. Developed with a focus on cost efficiency, it achieves competitive performance while using only a fraction of the training budget compared to similar models.

Implementation Details

The model employs several innovative techniques to achieve its efficiency:

  • Random masking of up to 75% of image patches during training
  • Deferred masking strategy with patch-mixer preprocessing
  • Mixture-of-experts layers for improved performance
  • Training pipeline progressing from 256×256 to 512×512 resolution
  • Total training time of 2.6 days on 8×H100 GPUs

Core Capabilities

  • Zero-shot generation with 12.7 FID score on COCO dataset
  • Multiple style generations including Origami, Pixel art, Line art, Cyberpunk, etc.
  • Four pre-trained model variants with different training data configurations
  • Efficient image generation at 512×512 resolution

Frequently Asked Questions

Q: What makes this model unique?

MicroDiT achieves comparable performance to larger models while requiring 118x lower costs than Stable Diffusion models and 14x lower costs than current state-of-the-art approaches. This is achieved through innovative masking strategies and efficient architecture design.

Q: What are the recommended use cases?

The model is particularly well-suited for text-to-image generation tasks, especially when resources are limited. It can generate high-quality images in various styles and is effective for both real and synthetic image generation tasks.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.