Mochi-1 Preview
Property | Value |
---|---|
Parameter Count | 10 Billion |
Model Type | Text-to-Video Generation |
License | Apache 2.0 |
Architecture | Asymmetric Diffusion Transformer (AsymmDiT) |
VRAM Requirements | 60GB (Single GPU) |
What is mochi-1-preview?
Mochi-1 Preview is a groundbreaking open-source video generation model developed by Genmo. It represents the largest openly released video generative model, featuring a novel Asymmetric Diffusion Transformer architecture. The model excels at producing high-fidelity motion and maintains strong adherence to input prompts, effectively bridging the gap between closed and open video generation systems.
Implementation Details
The model architecture combines an AsymmDiT with 48 layers and 24 attention heads, processing both visual (3072-dim) and text (1536-dim) tokens. It utilizes a single T5-XXL language model for prompt encoding and features an innovative AsymmVAE for efficient video compression at 128x smaller sizes.
- Visual Processing: 44,520 tokens with 3072-dimensional representation
- Text Processing: 256 tokens with 1536-dimensional representation
- Efficient compression with 8x8 spatial and 6x temporal reduction
Core Capabilities
- High-quality video generation at 480p resolution
- Strong prompt adherence and realistic motion synthesis
- Efficient context parallel implementation
- Support for both multi-GPU and single-GPU operations
- Integration with popular frameworks like Diffusers and ComfyUI
Frequently Asked Questions
Q: What makes this model unique?
The model's asymmetric architecture and massive scale (10B parameters) make it stand out, along with its ability to maintain high-fidelity motion while closely following text prompts. It's also the largest openly released video generation model with a permissive Apache 2.0 license.
Q: What are the recommended use cases?
The model excels at generating photorealistic videos from text descriptions. It's particularly suited for creating high-quality motion content, though it's not optimized for animated or cartoon-style content. Users should be aware of the 480p resolution limitation and potential minor warping in cases of extreme motion.