Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper time-order arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our Dysen-VDM consistently outperforms prior arts with significant margins, especially in scenarios with complex actions. Codes at https://haofei.vip/Dysen-VDM
翻译:文本到视频(T2V)合成在学界日益受到关注,其中近期涌现的扩散模型(DM)已展现出优于以往方法的显著性能。尽管现有最先进的DM能够实现高分辨率视频生成,但在复杂时序动态建模(视频合成的关键难题之一)方面,它们仍面临严重局限性(例如动作发生顺序紊乱、视频运动粗糙)。本研究致力于增强DM对视频动态的感知能力,以实现高质量T2V生成。受人类直觉启发,我们设计了一个创新性的动态场景管理器(名为Dysen)模块,包含:(步骤一)从输入文本中提取关键动作并赋予合理的时间顺序;(步骤二)将动作调度转化为动态场景图(DSG)表征;(步骤三)为DSG中的场景补充充分且合理的细节。通过借助现有强大LLM(如ChatGPT)的上下文学习能力,Dysen实现了(近乎)人类水平的时序动态理解。最终,包含丰富动作场景细节的视频DSG被编码为细粒度时空特征,集成到主干T2V DM中用于视频生成。在主流T2V数据集上的实验表明,我们的Dysen-VDM在各项指标上持续显著优于先前方法,尤其在涉及复杂动作的场景中。代码见https://haofei.vip/Dysen-VDM。