Lumina-Next-SFT

Alpha-VLLM

Next-DiT model (2B params) using Gemma-2B encoder for text-to-image generation. Features supervised fine-tuning and SDXL VAE for enhanced image quality.

Property	Value
Parameters	2B
License	Apache-2.0
Paper	Link
Resolution	Up to 2K

What is Lumina-Next-SFT?

Lumina-Next-SFT is an advanced text-to-image generation model that combines the power of Next-DiT architecture with Google's Gemma-2B language model as its text encoder. This supervised fine-tuned model represents a significant advancement in image generation capabilities, utilizing stabilityai's fine-tuned SDXL VAE for enhanced image quality.

Implementation Details

The model architecture consists of three main components: a Next-DiT backbone for image generation, Gemma-2B for text encoding, and an SDXL VAE for image processing. It implements Rectified Flow for prediction and supports various resolution options up to 2K.

Flexible resolution support (1024x1024, 512x2048, 2048x512, and more)
Configurable sampling steps (1-1000)
Advanced transport options including Linear, GVP, and VP paths
Time-aware scaling method with adjustable parameters

Core Capabilities

High-quality image generation at multiple resolutions
Efficient memory usage and faster generation times
Sophisticated text understanding through Gemma-2B integration
Customizable inference settings for different image styles

Frequently Asked Questions

Q: What makes this model unique?

The combination of Next-DiT architecture with Gemma-2B text encoder and supervised fine-tuning creates a powerful and efficient image generation system. The model's ability to handle multiple resolutions and its optimized memory usage sets it apart from similar models.

Q: What are the recommended use cases?

The model excels in high-quality image generation tasks, particularly where precise text-to-image translation is required. It's suitable for both standard 1024x1024 images and specialized aspect ratios up to 2K resolution.