Allegro-TI2V

rhymes-ai

Allegro-TI2V is an advanced open-source text-image-to-video generation model capable of creating 6-second high-resolution videos from prompts and images.

Property	Value
License	Apache 2.0
Paper	arXiv:2410.15458
Parameters	VAE: 175M, DiT: 2.8B
Resolution	720 x 1280
Video Length	6 seconds @ 15 FPS
GPU Memory	9.3GB (BF16 with CPU offload)

What is Allegro-TI2V?

Allegro-TI2V is a state-of-the-art text-image-to-video generation model that combines the power of text prompts and input images to create high-quality video content. It represents a significant advancement in AI-powered video generation, capable of producing detailed 6-second videos at 15 FPS with impressive 720x1280 resolution.

Implementation Details

The model architecture consists of two main components: a 175M parameter VideoVAE and a 2.8B parameter VideoDiT model. It supports multiple precision formats (FP32, BF16, FP16) and efficiently manages GPU memory usage through CPU offloading capabilities. The model features a substantial context length of 79.2K, allowing it to process 88 frames effectively.

Supports both first-frame and first-and-last frame video generation
Implements efficient memory management with CPU offloading options
Offers flexible precision options for different hardware configurations
Processes videos at high resolution with consistent quality

Core Capabilities

Generate videos from user prompts and first frame images
Create intermediate video content using both first and last frame inputs
Support for diverse content types including human subjects and dynamic scenes
Interpolation capability to 30 FPS using EMA-VFI
Efficient processing with minimal GPU memory requirements

Frequently Asked Questions

Q: What makes this model unique?

Allegro-TI2V stands out for its ability to generate high-resolution videos while maintaining relatively modest hardware requirements through efficient architecture and CPU offloading. It's also notable for being fully open-source under the Apache 2.0 license, making it accessible for both research and commercial applications.

Q: What are the recommended use cases?

The model excels in creating dynamic video content from static images, making it ideal for content creators, digital artists, and developers working on video generation applications. It's particularly suited for scenarios requiring the transformation of still images into fluid motion sequences with specific creative direction through text prompts.