Integrating robots into populated environments is a complex challenge that requires an understanding of human social dynamics. In this work, we propose to model social motion forecasting in a shared human-robot representation space, which facilitates us to synthesize robot motions that interact with humans in social scenarios despite not observing any robot in the motion training. We develop a transformer-based architecture called ECHO, which operates in the aforementioned shared space to predict the future motions of the agents encountered in social scenarios. Contrary to prior works, we reformulate the social motion problem as the refinement of the predicted individual motions based on the surrounding agents, which facilitates the training while allowing for single-motion forecasting when only one human is in the scene. We evaluate our model in multi-person and human-robot motion forecasting tasks and obtain state-of-the-art performance by a large margin while being efficient and performing in real-time. Additionally, our qualitative results showcase the effectiveness of our approach in generating human-robot interaction behaviors that can be controlled via text commands. Webpage: https://evm7.github.io/ECHO/
翻译:将机器人融入有人环境是一个复杂的挑战,需要理解人类社交动态。在这项工作中,我们提出在共享的人-机表征空间中建模社会运动预测,这使我们能够在未观测到任何机器人运动训练数据的情况下,合成与人类进行社交交互的机器人运动。我们开发了一种基于Transformer的架构——ECHO,它在上述共享空间中运行,以预测社交场景中遇到的所有主体的未来运动。与先前工作不同,我们将社会运动问题重新定义为基于周围主体对预测的个体运动进行细化,这既便于训练,又能在场景中仅有一人时实现单人体运动预测。我们在多人及人-机运动预测任务中评估我们的模型,以大幅优势获得最先进性能,同时保持高效并实现实时运行。此外,我们的定性结果展示了我们的方法在生成可通过文本指令控制的人机交互行为方面的有效性。网页:https://evm7.github.io/ECHO/