Taiyi-Stable-Diffusion-XL-3.5B

IDEA-CCNL

Bilingual text-to-image diffusion model with 3.5B parameters, optimized for both Chinese and English prompts. Features enhanced CLIP-based architecture and superior generation quality.

Property	Value
License	Apache 2.0
Paper	arXiv:2401.14688
Language Support	English, Chinese (Bilingual)
Framework	Diffusers

What is Taiyi-Stable-Diffusion-XL-3.5B?

Taiyi-Stable-Diffusion-XL-3.5B is an advanced bilingual text-to-image generation model that builds upon the success of Stable Diffusion XL while specifically enhancing Chinese language capabilities. The model represents a significant advancement in bilingual AI image generation, offering superior performance in both English and Chinese text prompts.

Implementation Details

The model utilizes a three-stage training process, incorporating an enhanced CLIP text encoder with expanded vocabulary and position encoding. It's built on the Stable-Diffusion-XL architecture and trained using high-quality image-text pairs with detailed descriptive captions generated by vision-language models.

Multi-resolution and multi-aspect ratio training pipeline
Enhanced CLIP-based text encoder with bilingual capabilities
Memory-efficient training approach with contrastive loss function
Support for both Chinese and English text prompts

Core Capabilities

Superior bilingual text-to-image generation
High CLIP similarity scores (0.254 for English, 0.225 for Chinese)
Improved FID scores compared to previous models
Photorealistic image generation capabilities
Support for various artistic styles and compositions

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its exceptional bilingual capabilities, outperforming existing open-source alternatives in both English and Chinese text-to-image generation. It achieves this while maintaining high image quality and accurate prompt following.

Q: What are the recommended use cases?

The model is ideal for applications requiring high-quality image generation from both English and Chinese text prompts, including digital art creation, content generation, and visual design. It's particularly effective for photographic-style outputs and can be accelerated using LCM for faster generation.