扩散隐式策略：面向非配对场景感知运动合成 (Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis)

Human motion generation is a long-standing problem, and scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data whose quantity is limited. Meanwhile, it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this framework, we disentangle human-scene interaction from motion synthesis during training and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. The proposed implicit policy optimizes the intermediate noised motion in a GAN Inversion manner to maintain motion continuity and control keyframe poses though the ControlNet branch and motion inpainting. For long-term motion synthesis, we introduce motion blending for stable transitions between multiple sub-tasks, where motions are fused in rotation power space and translation linear space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes. https://jingyugong.github.io/DiffusionImplicitPolicy/

翻译：人体运动生成是一个长期存在的问题，而场景感知的运动合成因其广泛的应用前景近年来受到广泛研究。主流方法严重依赖于数量有限的配对运动-场景数据。同时，仅在少数特定场景上训练时，模型难以泛化到多样化的场景。为此，我们提出一个统一框架，称为扩散隐式策略，用于场景感知的运动合成，该框架不再需要配对的运动-场景数据。在此框架中，我们在训练阶段将人-场景交互与运动合成解耦，进而在推理阶段将基于交互的隐式策略引入运动扩散过程。合成运动可通过迭代扩散去噪与隐式策略优化得到，从而同时保持运动的自然性与交互的合理性。所提出的隐式策略以GAN反演方式优化中间噪声运动，以保持运动连续性，并通过ControlNet分支与运动修复技术控制关键帧姿态。对于长时程运动合成，我们引入运动融合技术以实现多个子任务间的稳定过渡，其中运动在旋转幂空间与平移线性空间中进行融合。所提方法在基于ShapeNet家具的合成场景以及来自PROX与Replica的真实场景上进行了评估。结果表明，相较于前沿方法，我们的框架在运动自然性与交互合理性方面表现更优。这也表明利用扩散隐式策略在更通用任务与多样化场景中进行运动合成具有可行性。https://jingyugong.github.io/DiffusionImplicitPolicy/