Most of these text-to-video (T2V) generative models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by 'the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce a simple and effective Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., 'a red panda climbing a tree') and second scene description (e.g., 'the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline by achieving a relative gain of 29% in the overall score, which averages visual consistency and text adherence using human evaluation.
翻译:大多数文本到视频(T2V)生成模型通常只能生成描绘单一场景的视频片段,例如“一只红熊猫爬树”。然而,由于多场景视频在现实世界中无处不在(例如“一只红熊猫爬树”之后是“红熊猫在树顶睡觉”),生成多场景视频具有重要意义。为了利用预训练的T2V模型生成多场景视频,我们提出了一种简单有效的时间对齐字幕(TALC)框架。具体而言,我们改进了T2V架构中的文本条件机制,使其能够识别视频场景与场景描述之间的时间对齐关系。例如,我们分别用第一个场景描述(如“一只红熊猫爬树”)和第二个场景描述(如“红熊猫在树顶睡觉”)的表示来条件化生成视频中较早和较晚场景的视觉特征。结果表明,该T2V模型能够生成符合多场景文本描述且在视觉上(如实体和背景)保持一致的多场景视频。此外,我们利用TALC框架,使用多场景视频-文本数据对预训练的T2V模型进行微调。实验表明,经过TALC微调的模型在整体评分上优于基线模型,取得了29%的相对提升,该评分通过人工评估综合了视觉一致性和文本符合度。