Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.
翻译:现有的人体运动控制视频生成方法通常依赖于二维姿态或显式的三维参数化模型(如SMPL)作为控制信号。然而,二维姿态将运动严格绑定于驱动视角,无法实现新视角合成。显式三维模型虽能提供结构信息,但存在固有的不准确性(如深度模糊性与动力学误差),当其作为强约束使用时,会压制大规模视频生成器本身强大的内在三维感知能力。本文从三维感知的视角重新审视运动控制,主张采用一种隐式的、视角无关的运动表示,其自然地与生成器的空间先验对齐,而非依赖于外部重建的约束。我们提出了3DiMo,该方法联合训练一个运动编码器与预训练视频生成器,将驱动帧蒸馏为紧凑的、视角无关的运动令牌,并通过交叉注意力进行语义注入。为增强三维感知能力,我们采用视角丰富的监督数据(即单视角、多视角及运动相机视频)进行训练,强制模型在不同视角下保持运动一致性。此外,我们引入辅助几何监督,该监督仅利用SMPL进行早期初始化并随后退火至零,使模型能够从外部三维指导过渡到从数据及生成器先验中学习真实的三维空间运动理解。实验证实,3DiMo能够结合灵活的、文本驱动的相机控制,忠实地复现驱动运动,在运动保真度与视觉质量上均显著超越现有方法。