CogVideoX-2b

THUDM

CogVideoX-2b is an open-source text-to-video diffusion model offering 720x480 video generation at 8fps, optimized for low VRAM usage starting from 4GB with FP16 precision.

Property	Value
License	Apache 2.0
Paper	arXiv:2408.06072
Framework	Diffusers
Task	Text-to-Video Generation

What is CogVideoX-2b?

CogVideoX-2b is an entry-level text-to-video generation model designed for efficient video creation with minimal computational requirements. It represents the lightweight version of the CogVideoX family, capable of generating 6-second videos at 720x480 resolution with 8 frames per second.

Implementation Details

The model utilizes FP16 precision and features remarkable VRAM optimization, requiring as little as 4GB when using diffusers with optimizations enabled. It employs 3d_sincos_pos_embed positional encoding and supports various precision formats including FP16, BF16, FP32, and INT8.

Inference speed: ~90 seconds on A100, ~45 seconds on H100 (50 steps)
VRAM usage: 18GB with SAT, 4GB with diffusers (FP16)
Supports English prompts up to 226 tokens
Compatible with PytorchAO and Optimum-quanto for quantization

Core Capabilities

High-quality video generation from text descriptions
Efficient memory management with multiple optimization options
Support for various precision formats and quantization methods
Multi-GPU inference support
Fine-tuning capabilities with LORA and SFT options

Frequently Asked Questions

Q: What makes this model unique?

CogVideoX-2b stands out for its efficient balance between performance and resource requirements, making it accessible for users with limited computational resources while maintaining good video generation quality.

Q: What are the recommended use cases?

The model is ideal for standard text-to-video generation tasks, particularly suited for development and testing environments, content creation, and scenarios where computational resources are limited but quality video generation is still required.