This paper aims to manipulate multi-entity 3D motions in video generation. Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved remarkable synthesis results. However, 2D control signals are inherently limited in expressing the 3D nature of object motions. To overcome this problem, we introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space, given user-desired 6DoF pose (location and rotation) sequences of entities. At the core of our approach is a plug-and-play 3D-motion grounded object injector that fuses multiple input entities with their respective 3D trajectories through a gated self-attention mechanism. In addition, we exploit an injector architecture to preserve the video diffusion prior, which is crucial for generalization ability. To mitigate video quality degradation, we introduce a domain adaptor during training and employ an annealed sampling strategy during inference. To address the lack of suitable training data, we construct a 360-Motion Dataset, which first correlates collected 3D human and animal assets with GPT-generated trajectory and then captures their motion with 12 evenly-surround cameras on diverse 3D UE platforms. Extensive experiments show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions. Project page: http://fuxiao0719.github.io/projects/3dtrajmaster
翻译:本文旨在实现对视频生成中多实体三维运动的操控。以往的可控视频生成方法主要利用二维控制信号来操控物体运动,并已取得显著的合成效果。然而,二维控制信号在表达物体运动的三维本质方面存在固有局限。为克服此问题,我们提出了3DTrajMaster,这是一个鲁棒的控制器,能够在给定用户期望的实体六自由度姿态(位置与旋转)序列的条件下,调控三维空间中的多实体动力学。我们方法的核心是一个即插即用的、基于三维运动的物体注入器,它通过门控自注意力机制将多个输入实体与其各自的三维轨迹相融合。此外,我们利用注入器架构来保持视频扩散先验,这对于泛化能力至关重要。为减轻视频质量下降,我们在训练期间引入了领域适配器,并在推理时采用了退火采样策略。针对缺乏合适训练数据的问题,我们构建了一个360度运动数据集,该数据集首先将收集的三维人体与动物资产与GPT生成的轨迹相关联,然后在多样化的三维虚幻引擎平台上使用12个均匀环绕的摄像机捕捉其运动。大量实验表明,3DTrajMaster在控制多实体三维运动的准确性和泛化性方面均达到了新的最优水平。项目页面:http://fuxiao0719.github.io/projects/3dtrajmaster