In the paradigm of AI-generated content (AIGC), there has been increasing attention in extending pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face challenges in maintaining consistent narratives and handling rapid shifts in scene composition or object placement from a single user prompt. This paper introduces a new framework, dubbed DirecT2V, which leverages instruction-tuned large language models (LLMs) to generate frame-by-frame descriptions from a single abstract user prompt. DirecT2V utilizes LLM directors to divide user inputs into separate prompts for each frame, enabling the inclusion of time-varying content and facilitating consistent video generation. To maintain temporal consistency and prevent object collapse, we propose a novel value mapping method and dual-softmax filtering. Extensive experimental results validate the effectiveness of the DirecT2V framework in producing visually coherent and consistent videos from abstract user prompts, addressing the challenges of zero-shot video generation.
翻译:在人工智能生成内容(AIGC)范式中,将预训练文本到图像(T2I)模型扩展到文本到视频(T2V)生成受到越来越多的关注。尽管这些框架有效,但在从单一用户提示中维持连贯叙事并处理场景构图或物体放置的快速变化方面仍面临挑战。本文提出了一种名为DirecT2V的新框架,该框架利用指令微调的大型语言模型(LLMs),从单个抽象用户提示生成逐帧描述。DirecT2V利用LLM导演将用户输入划分为每个帧的单独提示,从而能够包含时变内容并促进连贯视频生成。为保持时间一致性并防止物体坍缩,我们提出了一种新颖的值映射方法和双softmax过滤。大量实验结果验证了DirecT2V框架在从抽象用户提示生成视觉连贯且一致的视频方面的有效性,解决了零样本视频生成的挑战。