sd-vae-ft-mse
Property | Value |
---|---|
License | MIT |
Author | StabilityAI |
Training Steps | 840,001 |
Framework | Diffusers |
What is sd-vae-ft-mse?
The sd-vae-ft-mse is an improved variational autoencoder (VAE) specifically designed for Stable Diffusion models. It represents a significant advancement over the original VAE, having been fine-tuned on a combination of LAION-Aesthetics and LAION-Humans datasets with an emphasis on MSE (Mean Squared Error) loss for superior image reconstruction quality.
Implementation Details
This model was developed through a two-stage training process, first being trained for 313,198 steps as ft-EMA, then continued for another 280,000 steps with modified loss functions focusing on MSE reconstruction (MSE + 0.1 * LPIPS). The training utilized 16 A100 GPUs with a batch size of 12 per GPU, totaling 192 samples per batch.
- Trained on a 1:1 ratio of LAION-Aesthetics and LAION-Humans datasets
- Implements EMA (Exponential Moving Average) weights
- Focuses on smoother output generation
- Compatible as a drop-in replacement for existing autoencoders
Core Capabilities
- Improved PSNR scores (24.5 ±3.7 on COCO2017)
- Enhanced SSIM metrics (0.71 ±0.13)
- Better face reconstruction quality
- Smoother overall image outputs
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its specialized fine-tuning approach that emphasizes MSE loss, resulting in smoother and more accurate image reconstructions, particularly for human faces and detailed imagery.
Q: What are the recommended use cases?
The model is best suited for Stable Diffusion pipelines where high-quality image reconstruction is crucial, especially when working with human subjects or detailed scenes requiring precise detail preservation.