AltDiffusion-m9
Property | Value |
---|---|
License | CreativeML OpenRAIL-M |
Architecture | Stable Diffusion-based |
Total Parameters | ~1.8B (859M TextEncoder + 865M Unet + 83.7M AutoEncoder) |
Research Paper | AltDiffusion Paper |
Supported Languages | English, Chinese, Spanish, French, Russian, Japanese, Korean, Arabic, Italian |
What is AltDiffusion-m9?
AltDiffusion-m9 is a groundbreaking multilingual text-to-image diffusion model that extends Stable Diffusion's capabilities across nine different languages. Built by BAAI, it uses AltCLIP-m9 as its text encoder and maintains high-quality image generation while enabling cross-lingual capabilities.
Implementation Details
The model is implemented using a combination of three main components: an AutoEncoder (83.7M parameters), a Unet (865M parameters), and the AltCLIP-m9 TextEncoder (859M parameters). It requires at least 10GB GPU memory for inference and supports various sampling methods including DDIM and DPM-Solver.
- Built on Stable Diffusion architecture with multilingual enhancements
- Trained on WuDao and LAION datasets
- Implements fast DPM scheduler for efficient generation
- Supports both Diffusers and Transformers pipelines
Core Capabilities
- Multilingual text-to-image generation in 9 languages
- Enhanced cross-lingual alignment capabilities
- Improved long image generation
- Maintains original Stable Diffusion capabilities while adding multilingual support
- 2-second generation time on V100 GPU with fast DPM scheduler
Frequently Asked Questions
Q: What makes this model unique?
AltDiffusion-m9's primary distinction is its robust multilingual capabilities, allowing it to generate images from prompts in 9 different languages while maintaining high-quality output comparable to or better than the original Stable Diffusion model.
Q: What are the recommended use cases?
The model is ideal for multilingual creative applications, including concept art generation, character design, and general image creation tasks where language flexibility is important. It's particularly useful for applications requiring cross-lingual image generation capabilities.