CoreML Stable Diffusion 2.1 Base

Property	Value
License	CreativeML OpenRAIL-M
Original Paper	High-Resolution Image Synthesis With Latent Diffusion Models (CVPR 2022)
Platform Support	Apple Silicon Devices
Primary Use	Text-to-Image Generation

What is coreml-stable-diffusion-2-1-base?

This is a Core ML-optimized version of the Stable Diffusion 2.1 base model, specifically converted for efficient performance on Apple Silicon devices. It builds upon the stable-diffusion-2-base model with 220k additional training steps and enhanced safety filters. The model comes in two variants: a split_einsum version compatible with all compute units including Neural Engine, and an original version for CPU & GPU operations.

Implementation Details

The model utilizes a Latent Diffusion architecture with a fixed, pretrained OpenCLIP-ViT/H text encoder. It processes images through an autoencoder with a relative downsampling factor of 8, mapping images from H x W x 3 to latents of H/f x W/f x 4. The model integrates seamlessly with applications like Mochi Diffusion for practical image generation tasks.

Optimized performance on Apple Silicon through Core ML conversion
Enhanced safety filters with punsafe=0.98 threshold
Compatible with Neural Engine through split_einsum implementation
Supports 512x512 resolution image generation

Core Capabilities

High-quality text-to-image generation
Efficient processing on Apple devices
Research and artistic applications
Educational and creative tool integration

Frequently Asked Questions

Q: What makes this model unique?

This model's unique value lies in its optimization for Apple Silicon devices through Core ML conversion, offering both split_einsum and original versions for different compute unit compatibility while maintaining the high-quality generation capabilities of Stable Diffusion 2.1.

Q: What are the recommended use cases?

The model is recommended for research purposes, artistic creation, educational tools, and design applications. It's particularly suitable for users working on Apple Silicon devices who need efficient, local text-to-image generation capabilities.