Human motion prediction is crucial for human-centric multimedia understanding and interacting. Current methods typically rely on ground truth human poses as observed input, which is not practical for real-world scenarios where only raw visual sensor data is available. To implement these methods in practice, a pre-phrase of pose estimation is essential. However, such two-stage approaches often lead to performance degradation due to the accumulation of errors. Moreover, reducing raw visual data to sparse keypoint representations significantly diminishes the density of information, resulting in the loss of fine-grained features. In this paper, we propose \textit{LiDAR-HMP}, the first single-LiDAR-based 3D human motion prediction approach, which receives the raw LiDAR point cloud as input and forecasts future 3D human poses directly. Building upon our novel structure-aware body feature descriptor, LiDAR-HMP adaptively maps the observed motion manifold to future poses and effectively models the spatial-temporal correlations of human motions for further refinement of prediction results. Extensive experiments show that our method achieves state-of-the-art performance on two public benchmarks and demonstrates remarkable robustness and efficacy in real-world deployments.
翻译:人体运动预测对于以人为中心的多媒体理解与交互至关重要。当前方法通常依赖真实人体姿态作为观测输入,这在仅能获取原始视觉传感器数据的现实场景中并不实用。为了在实践中应用这些方法,一个前置的姿态估计阶段是必需的。然而,这种两阶段方法往往会因误差累积而导致性能下降。此外,将原始视觉数据简化为稀疏关键点表示会显著降低信息密度,导致细粒度特征的丢失。本文提出\textit{LiDAR-HMP},首个基于单LiDAR的三维人体运动预测方法,该方法以原始LiDAR点云作为输入,直接预测未来的三维人体姿态。基于我们新颖的结构感知身体特征描述符,LiDAR-HMP自适应地将观测到的运动流形映射到未来姿态,并有效建模人体运动的时空相关性,以进一步优化预测结果。大量实验表明,我们的方法在两个公开基准测试中取得了最先进的性能,并在实际部署中展现出卓越的鲁棒性和有效性。