Integrating robots into populated environments is a complex challenge that requires an understanding of human social dynamics. In this work, we propose to model social motion forecasting in a shared human-robot representation space, which facilitates us to synthesize robot motions that interact with humans in social scenarios despite not observing any robot in the motion training. We develop a transformer-based architecture called ECHO, which operates in the aforementioned shared space to predict the future motions of the agents encountered in social scenarios. Contrary to prior works, we reformulate the social motion problem as the refinement of the predicted individual motions based on the surrounding agents, which facilitates the training while allowing for single-motion forecasting when only one human is in the scene. We evaluate our model in multi-person and human-robot motion forecasting tasks and obtain state-of-the-art performance by a large margin while being efficient and performing in real-time. Additionally, our qualitative results showcase the effectiveness of our approach in generating human-robot interaction behaviors that can be controlled via text commands.
翻译:将机器人融入人群环境是一项复杂挑战,需要理解人类社交动态。在本工作中,我们提出在共享的人-机器人表征空间中建模社会运动预测,这使得我们能够在运动训练中未观察到任何机器人的情况下,合成机器人在社交场景中与人类交互的动作。我们开发了一种名为ECHO的基于Transformer的架构,该架构在上述共享空间中运行,用于预测社交场景中遇到的智能体的未来运动。与先前工作不同,我们将社交运动问题重新表述为基于周围智能体对预测个体运动的细化过程,这既便于训练,又能在场景中仅有一人时实现单目标运动预测。我们在多人及人-机器人运动预测任务上评估了模型,以显著优势取得了最先进性能,同时保持高效且实时运行。此外,我们的定性结果展示了该方法在生成可通过文本指令控制的人机交互行为方面的有效性。