Wan2.1-T2V-1.3B-Diffusers

Property	Value
Parameter Count	1.3B
Model Type	Text-to-Video Diffusion
License	Apache 2.0
GPU Memory Required	8.19GB VRAM
Supported Resolution	480P (Optimal)

What is Wan2.1-T2V-1.3B-Diffusers?

Wan2.1-T2V-1.3B-Diffusers is a groundbreaking text-to-video generation model that combines efficiency with powerful capabilities. It's designed to run on consumer-grade GPUs while delivering high-quality video outputs comparable to some closed-source solutions. The model can generate a 5-second 480P video in approximately 4 minutes on an RTX 4090.

Implementation Details

The model utilizes a Flow Matching framework within the Diffusion Transformer paradigm, featuring a dimension of 1536, 30 layers, and 12 attention heads. It employs a T5 Encoder for multilingual text processing and includes a novel spatio-temporal variational autoencoder (Wan-VAE) for efficient video processing.

Dimension: 1536
Input/Output Dimension: 16
Feedforward Dimension: 8960
Number of Layers: 30
Number of Heads: 12

Core Capabilities

Text-to-Video generation with optimal 480P resolution
Multilingual text generation support (Chinese and English)
Efficient video processing with minimal VRAM requirements
Support for prompt extension through Dashscope API or local models
Compatible with Diffusers pipeline and various inference methods

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to run on consumer GPUs while maintaining high-quality output sets it apart. It requires only 8.19GB VRAM, making it accessible to most users while delivering performance comparable to larger models.

Q: What are the recommended use cases?

The model excels in generating short-form videos from text descriptions, particularly at 480P resolution. It's ideal for creative teams needing quick video generation capabilities without extensive computational resources.