Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising



1 Shanghai AI laboratory     2 MMLab, CUHK     3 SenseTime Research     4 Nanjing University     5 Tsinghua University

TL;DR: A universal methodology that extends short video diffusion models for efficient multi-text conditioned long video generation and editing.

Iron man is laying. Super long, 400+ frames.

Video with smooth semantic changes.

Dog in the sun. Layout control.

Animation girl surfing.

A car running on the beach. Tuning-free.

Van Gogh Style Painting.

A girl is walking on the moon. Imagination.

Abstract

Leveraging large-scale image-text datasets and advancements in diffusion models, text-driven generative models have made remarkable strides in the field of image generation and editing. This study explores the potential of extending this text-based image strategy to the generation and editing of videos. Current methodologies, while innovative, are often confined to extremely short videos (typically less than 24 frames) and are limited to a single text condition. These constraints significantly limit their application given that real-world videos usually consist of multiple segments, each bearing different semantic information. To address this, we introduce a novel paradigm named as Gen-L-Video capable of generating and editing videos comprising hundreds of frames with diverse semantic segments, all while preserving content consistency, even under the confines of limited GPU resources. We have implemented three mainstream text-driven video generation and editing methodologies and extended them to accommodate longer videos imbued with a variety of semantic segments with our proposed paradigm. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models, offering new possibilities for future research and applications.

Mainstream Diffusion Methods for Video Generation and Editing

Pretrained Text-to-Video (t2v): This involves training the diffusion model on a large-scale text-video paired dataset such as WebVid-10M. Typically, a temporal interaction module, like Temporal Attention, is added to the denoising model, fostering inter-frame information interaction to ensure frame consistency.

Tuning-free Text-to-Video (tuning-free t2v): This utilizes the pre-trained Text-to-Image model to generate and edit video frame-by-frame, while applying additional controls to maintain consistency across frames (for instance, copying and modifying the attention map, sparse casual attention etc.).

One-shot Tuning Text-to-Video (one-shot tuning t2v): By fine-tuning the pretrained text-to-image generation model using a single video instance, it's possible to generate videos with similar motions but different contents. Despite the extra training cost, one-shot tuning-based methods often offer more editing flexibility compared to tuning-free based methods.

Gen-L-Video via Temporal Path Co-denoising

Unlike previous methods, we do not construct or train a long-video generator directly. Instead, we view a video as a collection of short video clips, each possessing independent semantic information. Hence, the generation of a long video can be seen as a concatenation of multiple short videos. While instead of simplistic division we perceive long videos as short video clips with temporal overlapping. We demonstrate that under certain conditions, the denoising path of a long video can be approximated by joint denoising of overlapping short videos in temporal domain. In particular, as depicted in the above figure, the noisy long video is initially mapped into multiple noisy short video clips via a designated function. Subsequently, existing off-the-shelf short video diffusion models can be employed to denoise these clips under the guidance of various text conditions. These denoised short videos are then merged and inverted back to a less noisy original long video. Essentially, this procedure establishes an abstract long video generator and editor without necessitating any additional training, enabling the generation and editing of videos of any length using established short video generation and editing methodologies.

Multi-Text Conditioned Long Video

Source Video of Demon Slayer

Boy ➜ Girl

Boy ➜ Cute Girl, Shirt

Long Videos with Smooth Semantic Changes

A man is boating, village ➜ A man is walking by, city, sunset

A jeep car is running on the snow, sunny ➜ A jeep car is running on the snow, night

A jeep car is running on the beach, sunny ➜ A jeep car is running on the beach, night





















Lion, grass, rainy ➜ Cat, grass, Sun

Iron man is skiing in the snow➜ Iron man is flying in the sky

A man is surfing in the sea ➜ A man is skiing in the snow

Editing Anything in Video

Source Video of Man Surfing

Mask of Man

Man ➜ Iron Man





















SAM Segmentation Map

Control Map

Man ➜ Bat Man





















Source Video of Eating Pizza

Mask of Pizza

Pizza ➜ Cake





















Source Video of Girl in Water

Mask of Sunglasses

Sunglasses ➜ Pink Sunglasses

Controllable Video

Pose Control Video

A man is playing tennis ➜ A woman is playing tennis

A man is playing tennis ➜ A cute boy is playing tennis





















A man is playing tennis ➜ A man is running in the fire

A man is playing tennis➜ Iron man is fighting in the snow

A man is palying ➜ A Van Gogh style painting of a man dancing





















Depth Control Video

Cat in the sun ➜ Realistic tiger in the sun

Cat in the sun ➜ Cute girl in the sun

Gen-L-Video VS🔥 Isolated Short Video Generation

Astronaut riding a horse.

Isolated Generation. Bad.

Astronaut riding a horse.

Gen-L-Video. Good.

Astronaut riding a horse.

Gen-L-Video. Good.





















Astronaut riding a horse, Painting.

Isolated Generation. Bad.

Astronaut riding a horse, Painting.

Gen-L-Video. Good.

Astronaut riding a horse, Painting.

Gen-L-Video. Good.

More Results

Hiking

Tower

Road Driving

Cow Walking

Raining

Skiing/Surfing

Ablations with Tuning-Free Text-to-Video

Gen-L-Video with bi-directional cross-frame attention

Win!

Road ➜ Snow

Road ➜ Beach

Jeep car ➜ Red car





















Gen-L-Video with sparse casual attention

Road ➜ Snow

Road ➜ Beach

Jeep car ➜ Red car





















Gen-L-Video Isolated

Worst!

Road ➜ Snow

Road ➜ Beach

Jeep car ➜ Red car

BibTeX

@misc{wang2023genlvideo,
        title={Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising}, 
        author={Fu-Yun Wang and Wenshuo Chen and Guanglu Song and Han-Jia Ye and Yu Liu and Hongsheng Li},
        year={2023},
        eprint={2305.18264},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }
  

Explanation

1. For better visualization, we only show the edited word in this page. Please check our paper and code for the whole source prompt.
2. Most of the results are directly edited from Stable diffusion v1.4, while some are based on Anything-v4 .
3. Our method Gen-L-Video is a unifying paradigm that extends short video diffusion models for powerful multi-text conditioned long video generation and editing.