Comprehending human motion is a fundamental challenge for developing Human-Robot Collaborative applications. Computer vision researchers have addressed this field by only focusing on reducing error in predictions, but not taking into account the requirements to facilitate its implementation in robots. In this paper, we propose a new model based on Transformer that simultaneously deals with the real time 3D human motion forecasting in the short and long term. Our 2-Channel Transformer (2CH-TR) is able to efficiently exploit the spatio-temporal information of a shortly observed sequence (400ms) and generates a competitive accuracy against the current state-of-the-art. 2CH-TR stands out for the efficient performance of the Transformer, being lighter and faster than its competitors. In addition, our model is tested in conditions where the human motion is severely occluded, demonstrating its robustness in reconstructing and predicting 3D human motion in a highly noisy environment. Our experiment results show that the proposed 2CH-TR outperforms the ST-Transformer, which is another state-of-the-art model based on the Transformer, in terms of reconstruction and prediction under the same conditions of input prefix. Our model reduces in 8.89% the mean squared error of ST-Transformer in short-term prediction, and 2.57% in long-term prediction in Human3.6M dataset with 400ms input prefix.
翻译:理解人体运动是开发人机协作应用的基础挑战。计算机视觉研究者在该领域仅聚焦于降低预测误差,而未充分考虑便于在机器人中实现的需求。本文提出一种基于Transformer的新模型,可同时处理短期与长期的实时三维人体运动预测。我们设计的双通道Transformer(2CH-TR)能够高效利用短时观测序列(400毫秒)的时空信息,并达到与当前最先进技术相媲美的精确度。2CH-TR凭借Transformer的高效性能脱颖而出,比同类模型更轻量、更快速。此外,我们针对人体运动严重遮挡的场景进行了测试,证明了该模型在高噪声环境下重建与预测三维人体运动的鲁棒性。实验结果表明,在相同输入前缀条件下,所提出的2CH-TR在重建与预测性能上优于另一基于Transformer的先进模型ST-Transformer。在Human3.6M数据集上采用400毫秒输入前缀时,我们的模型在短期预测中将ST-Transformer的均方误差降低8.89%,在长期预测中降低2.57%。