Ruyi-Mini-7B
Property | Value |
---|---|
Parameter Count | 7.1 Billion |
Model Type | Image-to-Video Generation |
License | Apache 2.0 |
Author | IamCreateAI |
Model URL | https://huggingface.co/IamCreateAI/Ruyi-Mini-7B |
What is Ruyi-Mini-7B?
Ruyi-Mini-7B is an advanced open-source image-to-video generation model that transforms static images into dynamic videos. Built with approximately 7.1 billion parameters, it supports video generation at resolutions from 360p to 720p, with various aspect ratios and durations up to 5 seconds. The model incorporates sophisticated motion and camera control features, offering creators enhanced flexibility in video generation.
Implementation Details
The model architecture consists of three primary components: a Casual VAE Module for video compression, a Diffusion Transformer Module with 3D full attention, and a CLIP model for semantic feature extraction. The training process involved four intensive phases, including pre-training with 200M video clips, multi-scale resolution fine-tuning, and specialized image-to-video training.
- Casual VAE Module reduces spatial resolution to 1/8 and temporal resolution to 1/4
- 2D Normalized-RoPE for spatial dimensions
- Sin-cos position embedding for temporal dimensions
- DDPM for model training
- CLIP-guided video generation through cross-attention
Core Capabilities
- Supports resolutions from 360p to 720p
- Maximum video duration of 5 seconds
- Multiple aspect ratio support
- Motion and camera control features
- Various VRAM configurations (21.5GB-54.8GB)
Frequently Asked Questions
Q: What makes this model unique?
Ruyi-Mini-7B stands out for its comprehensive training approach across multiple phases and its ability to handle various video resolutions while maintaining quality. The inclusion of motion and camera control features provides creators with unprecedented control over video generation.
Q: What are the recommended use cases?
The model is ideal for creating short-form video content from static images, particularly useful in creative applications, content creation, and prototyping. However, users should note limitations with text rendering, hand representations, and crowded human faces.
Q: What are the hardware requirements?
Requirements vary by video size and resolution. For example, 360x480 videos need 21.5GB VRAM, while 720x1280 videos require 54.8GB. A low memory mode is available for 24GB VRAM cards like RTX4090.