sd-vae-ft-mse-original

Property	Value
License	MIT
Training Steps	840,001
Model Type	Variational Autoencoder (VAE)
Author	StabilityAI

What is sd-vae-ft-mse-original?

The sd-vae-ft-mse-original is an improved variational autoencoder specifically designed for Stable Diffusion. This model represents a significant enhancement over the original VAE, featuring MSE-focused fine-tuning trained on a combination of LAION-Aesthetics and LAION-Humans datasets. It was trained for 840,001 steps, emphasizing MSE reconstruction with a modified loss function (MSE + 0.1 * LPIPS).

Implementation Details

The model was developed through a two-stage training process, first being fine-tuned from the original kl-f8 autoencoder using EMA weights, then further refined with MSE-focused training. The training utilized 16 A100 GPUs with a batch size of 12 per GPU, resulting in a total batch size of 192.

Trained on a 1:1 ratio of LAION-Aesthetics and LAION-Humans datasets
Implements EMA (Exponential Moving Average) weights
Achieves improved PSNR scores of 24.5 ±3.7 on COCO 2017
Features enhanced SSIM scores of 0.71 ±0.13

Core Capabilities

Produces smoother image outputs compared to previous versions
Improved face reconstruction quality
Better overall image reconstruction metrics
Drop-in replacement compatibility with existing models

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its MSE-focused training approach and significant improvements in reconstruction quality, particularly for human faces and general image fidelity. The combination of MSE and LPIPS loss functions results in notably smoother outputs while maintaining detail accuracy.

Q: What are the recommended use cases?

The model is particularly well-suited for applications requiring high-quality image reconstruction, especially those involving human subjects. It's designed as a drop-in replacement for the original Stable Diffusion VAE, making it ideal for enhancing existing Stable Diffusion pipelines.