Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.
翻译:现有的人体视频生成运动控制方法通常依赖于二维姿态或显式的三维参数化模型(如SMPL)作为控制信号。然而,二维姿态将运动严格绑定至驱动视角,无法实现新视角合成。显式三维模型虽能提供结构信息,但存在固有的不准确性(如深度模糊性与动态不精确性),当其作为强约束时,会压制大规模视频生成器自身强大的内在三维感知能力。本研究从三维感知视角重新审视运动控制,主张采用一种隐式的、与视角无关的运动表示,该表示天然契合生成器的空间先验,而非依赖外部重建的约束。我们提出3DiMo方法,通过联合训练运动编码器与预训练视频生成器,将驱动帧蒸馏为紧凑的、视角无关的运动标记,并经由交叉注意力机制进行语义注入。为增强三维感知能力,我们采用视角丰富的监督数据(即单视角、多视角及运动相机视频)进行训练,强制模型在不同视角间保持运动一致性。此外,我们引入辅助几何监督,该监督仅利用SMPL进行早期初始化并随后退火至零,使模型能够从外部三维引导过渡至从数据及生成器先验中学习真实的三维空间运动理解。实验证实,3DiMo能够以灵活的文本驱动相机控制忠实复现驱动运动,在运动保真度与视觉质量上均显著超越现有方法。