We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/
翻译:本文提出DuoMo,一种从具有噪声或不完整观测的无约束视频中恢复世界空间坐标下人体运动的生成方法。重建此类运动需解决一个基本权衡:既要从多样且含噪声的视频输入中泛化,又要保持全局运动一致性。我们的方法通过将运动学习分解为两个扩散模型来解决该问题。相机空间模型首先在相机坐标系中从视频估计运动。世界空间模型随后将此初始估计提升至世界坐标系并进行精细化处理,使其具有全局一致性。两个模型协同工作,即使面对高度噪声或不完整的观测,也能重建跨多样场景与轨迹的运动。此外,我们的框架具有通用性,可直接生成网格顶点运动,绕过了参数化模型。DuoMo实现了最先进的性能。在EMDB数据集上,本方法在世界空间重建误差上降低了16%,同时保持较低足部滑动;在RICH数据集上,世界空间误差降低了30%。项目页面:https://yufu-wang.github.io/duomo/