mochi-1-preview

Maintained By
genmo

Mochi-1 Preview

PropertyValue
Parameter Count10 Billion
Model TypeText-to-Video Generation
LicenseApache 2.0
ArchitectureAsymmetric Diffusion Transformer (AsymmDiT)
VRAM Requirements60GB (Single GPU)

What is mochi-1-preview?

Mochi-1 Preview is a groundbreaking open-source video generation model developed by Genmo. It represents the largest openly released video generative model, featuring a novel Asymmetric Diffusion Transformer architecture. The model excels at producing high-fidelity motion and maintains strong adherence to input prompts, effectively bridging the gap between closed and open video generation systems.

Implementation Details

The model architecture combines an AsymmDiT with 48 layers and 24 attention heads, processing both visual (3072-dim) and text (1536-dim) tokens. It utilizes a single T5-XXL language model for prompt encoding and features an innovative AsymmVAE for efficient video compression at 128x smaller sizes.

  • Visual Processing: 44,520 tokens with 3072-dimensional representation
  • Text Processing: 256 tokens with 1536-dimensional representation
  • Efficient compression with 8x8 spatial and 6x temporal reduction

Core Capabilities

  • High-quality video generation at 480p resolution
  • Strong prompt adherence and realistic motion synthesis
  • Efficient context parallel implementation
  • Support for both multi-GPU and single-GPU operations
  • Integration with popular frameworks like Diffusers and ComfyUI

Frequently Asked Questions

Q: What makes this model unique?

The model's asymmetric architecture and massive scale (10B parameters) make it stand out, along with its ability to maintain high-fidelity motion while closely following text prompts. It's also the largest openly released video generation model with a permissive Apache 2.0 license.

Q: What are the recommended use cases?

The model excels at generating photorealistic videos from text descriptions. It's particularly suited for creating high-quality motion content, though it's not optimized for animated or cartoon-style content. Users should be aware of the 480p resolution limitation and potential minor warping in cases of extreme motion.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.