Recent text-to-video diffusion models have achieved impressive progress. In practice, users often desire the ability to control object motion and camera movement independently for customized video creation. However, current methods lack the focus on separately controlling object motion and camera movement in a decoupled manner, which limits the controllability and flexibility of text-to-video models. In this paper, we introduce Direct-a-Video, a system that allows users to independently specify motions for one or multiple objects and/or camera movements, as if directing a video. We propose a simple yet effective strategy for the decoupled control of object motion and camera movement. Object motion is controlled through spatial cross-attention modulation using the model's inherent priors, requiring no additional optimization. For camera movement, we introduce new temporal cross-attention layers to interpret quantitative camera movement parameters. We further employ an augmentation-based approach to train these layers in a self-supervised manner on a small-scale dataset, eliminating the need for explicit motion annotation. Both components operate independently, allowing individual or combined control, and can generalize to open-domain scenarios. Extensive experiments demonstrate the superiority and effectiveness of our method. Project page: https://direct-a-video.github.io/.
翻译:近年来,文本到视频的扩散模型取得了令人瞩目的进展。在实际应用中,用户通常希望独立控制物体运动和摄像机运动以进行定制化视频创作。然而,当前方法缺乏对物体运动和摄像机运动进行解耦式独立控制的关注,这限制了文本到视频模型的可控性和灵活性。本文提出Direct-a-Video系统,该系统允许用户如同导演视频般,独立指定一个或多个物体的运动以及/或摄像机运动。我们提出一种简洁而有效的策略,用于实现物体运动与摄像机运动的解耦控制。物体运动通过利用模型固有先验的空间交叉注意力调制进行控制,无需额外优化。针对摄像机运动,我们引入新的时间交叉注意力层来解析量化摄像机运动参数。我们进一步采用基于数据增强的方法,在小规模数据集上以自监督方式训练这些层,从而无需显式运动标注。两个组件独立运行,支持单独或联合控制,并可泛化至开放域场景。大量实验证明了我们方法的优越性和有效性。项目页面:https://direct-a-video.github.io/。