Robust and accurate perception of humans in their 3D scene context is essential for integrating robots into everyday environments. Existing approaches, however, often fail to predict plausible and accurate human motion estimates that are consistent with the surrounding scene, especially in the presence of heavy occlusions or partial visibility. This can limit both safety and efficiency for robotic operations. We introduce HumanFlow, a latent diffusion model that unifies human motion tracking and forecasting, conditioned on the 3D scene context. We show that our human motion model produces smooth and accurate predictions under challenging conditions, including heavy occlusions, and outperforms state-of-the-art methods in tracking accuracy while being significantly more efficient. Furthermore, we show how HumanFlow's latent space can be tightly coupled with control by conditioning a flow-matching-based, approximate MPC policy on these representations. We validate our policy in simulation with real human trajectories for MAV social navigation, demonstrating superior navigation performance and remaining collision-free, even under partial observability of the human.
翻译:在三维场景上下文中对行人进行鲁棒且精确的感知,对于将机器人融入日常环境至关重要。然而现有方法难以预测与周围场景一致的合理且准确的人体运动估计,尤其在严重遮挡或部分可见场景下更为突出。这限制机器人操作的安全性与效率。我们提出HumanFlow——一种潜扩散模型,统一了以三维场景为条件的人体运动跟踪与预测。实验表明,该模型在严重遮挡等挑战条件下仍能生成平滑准确的运动预测,在跟踪精度上超越现有最优方法,同时显著提升计算效率。进一步地,通过将基于流匹配的近似MPC策略条件化于HumanFlow的潜在表征,我们实现了运动预测与控制模块的紧耦合。基于真实行人轨迹的仿真MAV社交导航实验验证了该策略的优越性,即便在人体部分遮挡条件下仍能保持无碰撞的卓越导航性能。