Wan2.1-T2V-1.3B-Diffusers
Property | Value |
---|---|
Parameter Count | 1.3B |
Model Type | Text-to-Video Diffusion |
License | Apache 2.0 |
GPU Memory Required | 8.19GB VRAM |
Supported Resolution | 480P (Optimal) |
What is Wan2.1-T2V-1.3B-Diffusers?
Wan2.1-T2V-1.3B-Diffusers is a groundbreaking text-to-video generation model that combines efficiency with powerful capabilities. It's designed to run on consumer-grade GPUs while delivering high-quality video outputs comparable to some closed-source solutions. The model can generate a 5-second 480P video in approximately 4 minutes on an RTX 4090.
Implementation Details
The model utilizes a Flow Matching framework within the Diffusion Transformer paradigm, featuring a dimension of 1536, 30 layers, and 12 attention heads. It employs a T5 Encoder for multilingual text processing and includes a novel spatio-temporal variational autoencoder (Wan-VAE) for efficient video processing.
- Dimension: 1536
- Input/Output Dimension: 16
- Feedforward Dimension: 8960
- Number of Layers: 30
- Number of Heads: 12
Core Capabilities
- Text-to-Video generation with optimal 480P resolution
- Multilingual text generation support (Chinese and English)
- Efficient video processing with minimal VRAM requirements
- Support for prompt extension through Dashscope API or local models
- Compatible with Diffusers pipeline and various inference methods
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to run on consumer GPUs while maintaining high-quality output sets it apart. It requires only 8.19GB VRAM, making it accessible to most users while delivering performance comparable to larger models.
Q: What are the recommended use cases?
The model excels in generating short-form videos from text descriptions, particularly at 480P resolution. It's ideal for creative teams needing quick video generation capabilities without extensive computational resources.