TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., `a red panda climbing a tree') and second scene description (e.g., `the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation. The project website is https://talc-mst2v.github.io/.

翻译：基于扩散模型的生成式建模近期取得进展，推动了文本生成视频模型的发展，这些模型能够根据文本提示生成高质量视频。然而，多数现有模型仅能生成描述单一实体执行特定动作的单场景视频片段（如“一只小熊猫在爬树”）。由于多场景视频在现实世界中普遍存在（如“一只小熊猫在爬树”后接“小熊猫在树顶睡觉”），故其生成具有重要应用价值。为基于预训练模型生成多场景视频，本文提出时间对齐字幕框架。具体而言，我们优化了视频生成模型中的文本条件机制，使其能够识别视频场景与场景描述之间的时间对齐关系。例如，分别用第一段场景描述（如“一只小熊猫在爬树”）与第二段场景描述（如“小熊猫在树顶睡觉”）的表征，对生成视频前后场景的视觉特征进行条件约束。实验表明，该模型可生成符合多场景文本描述且视觉一致的多场景视频。此外，我们基于时间对齐字幕框架，利用多场景视频-文本数据对预训练模型进行微调。采用人工评估的视觉一致性与文本匹配度综合评分显示，经时间对齐字幕框架微调的模型综合得分较基线方法提升15.5分。项目网站为https://talc-mst2v.github.io/。