We propose RoHM, an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos in the presence of noise and occlusions. Most previous approaches either train neural networks to directly regress motion in 3D or learn data-driven motion priors and combine them with optimization at test time. The former do not recover globally coherent motion and fail under occlusions; the latter are time-consuming, prone to local minima, and require manual tuning. To overcome these shortcomings, we exploit the iterative, denoising nature of diffusion models. RoHM is a novel diffusion-based motion model that, conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates. Given the complexity of the problem -- requiring one to address different tasks (denoising and infilling) in different solution spaces (local and global motion) -- we decompose it into two sub-tasks and learn two models, one for global trajectory and one for local motion. To capture the correlations between the two, we then introduce a novel conditioning module, combining it with an iterative inference scheme. We apply RoHM to a variety of tasks -- from motion reconstruction and denoising to spatial and temporal infilling. Extensive experiments on three popular datasets show that our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html.
翻译:我们提出RoHM,一种从单目RGB(-D)视频中在噪声和遮挡条件下实现鲁棒三维人体运动重建的方法。现有方法大多通过训练神经网络直接回归三维运动,或在测试时结合数据驱动的运动先验与优化策略。前者无法恢复全局一致的运动且在遮挡条件下失效;后者耗时且易陷入局部最优,需要手动调参。为克服这些局限,我们利用扩散模型的迭代去噪特性。RoHM是一种新型基于扩散的运动模型,以带噪声和遮挡的输入数据为条件,重建出全局坐标系下完整且合理的运动。鉴于该问题的复杂性——需在不同解空间(局部与全局运动)中处理不同任务(去噪与填充),我们将其分解为两个子任务,分别学习全局轨迹模型和局部运动模型。为捕捉两者关联,我们引入新型条件模块,并结合迭代推理机制。我们将RoHM应用于从运动重建、去噪到时空填充的多种任务。在三个主流数据集上的大量实验表明,我们的方法在定性和定量上均超越现有最优方法,且推理速度更快。代码已开源:https://sanweiliti.github.io/ROHM/ROHM.html。