CogVideoX1.5-5B

THUDM

CogVideoX1.5-5B is a powerful 5B-parameter text-to-video generation model supporting high-resolution output (1360x768), 16fps videos with English prompts.

Property	Value
Model Type	Text-to-Video Generation
License	Custom CogVideoX License
Paper	arXiv:2408.06072
Resolution	1360 x 768
Frame Rate	16 FPS

What is CogVideoX1.5-5B?

CogVideoX1.5-5B is a state-of-the-art text-to-video generation model developed by THUDM. It's designed to generate high-quality video content from textual descriptions, supporting videos of 5 or 10 seconds duration at 16 frames per second. The model operates with BF16 precision and requires minimum 9GB GPU memory for single GPU inference.

Implementation Details

The model utilizes the Hugging Face diffusers library for deployment and supports various precision modes including BF16 (recommended), FP16, FP32, FP8, and INT8. It features sophisticated memory optimization techniques through VAE slicing and tiling, and supports multi-GPU inference with 24GB memory consumption.

Maximum prompt length: 224 tokens
Supports English language input
Compatible with NVIDIA Ampere architecture or higher
Inference speed: ~1000 seconds for 5-second video on A100

Core Capabilities

High-resolution video generation (1360x768)
Efficient memory management with various optimization options
Support for quantization using PytorchAO and Optimum-quanto
Flexible deployment options with CPU offloading capabilities

Frequently Asked Questions

Q: What makes this model unique?

CogVideoX1.5-5B stands out for its ability to generate high-resolution videos with detailed control over the generation process, while maintaining reasonable memory requirements through various optimization techniques.

Q: What are the recommended use cases?

The model is ideal for creating high-quality video content from textual descriptions, particularly useful for content creation, prototyping, and creative applications requiring detailed video generation from text prompts.