stepvideo-t2v

stepfun-ai

A state-of-the-art text-to-video model with 30B parameters, capable of generating long videos up to 204 frames with 16x16 spatial and 8x temporal compression

Property	Value
Parameter Count	30 Billion
Model Type	Text-to-Video Generation
Architecture	DiT with 3D Full Attention
Paper	arXiv:2502.10248
Maximum Resolution	544px x 992px x 204 frames

What is stepvideo-t2v?

StepVideo-T2V is a groundbreaking text-to-video generation model that represents the latest advancement in AI-powered video synthesis. With 30 billion parameters, it can generate high-quality videos up to 204 frames long, supporting both English and Chinese prompts. The model utilizes a novel deep compression VAE architecture achieving remarkable 16x16 spatial and 8x temporal compression ratios.

Implementation Details

The model is built on three main components: a Video-VAE for efficient compression, a DiT architecture with 3D full attention, and a video-based Direct Preference Optimization (DPO) system. The DiT architecture features 48 layers with 48 attention heads each, utilizing AdaLN-Single for timestep conditioning and QK-Norm for training stability.

Deep compression VAE with 16x16 spatial and 8x temporal compression
Dual bilingual text encoders for English and Chinese support
3D RoPE implementation for handling variable video lengths
Video-DPO for enhanced visual quality and human preference alignment

Core Capabilities

Generate videos up to 204 frames long
Support for high-resolution output (544px x 992px)
Bilingual prompt processing
Efficient inference with optional flash-attention support
Customizable generation parameters for quality-speed tradeoff

Frequently Asked Questions

Q: What makes this model unique?

The model's combination of high compression ratios, extensive parameter count, and video-specific DPO training makes it particularly effective for high-quality video generation. Its ability to handle long sequences of up to 204 frames while maintaining quality is unprecedented.

Q: What are the recommended use cases?

The model excels in generating videos across various categories including sports, food, scenery, animals, festivals, surreal concepts, and cinematography. It's particularly suitable for applications requiring high-quality video generation from textual descriptions in either English or Chinese.