This paper introduces a novel approach for human-to-robot motion retargeting, enabling robots to mimic human motion with precision while preserving the semantics of the motion. For that, we propose a deep learning method for direct translation from human to robot motion. Our method does not require annotated paired human-to-robot motion data, which reduces the effort when adopting new robots. To this end, we first propose a cross-domain similarity metric to compare the poses from different domains (i.e., human and robot). Then, our method achieves the construction of a shared latent space via contrastive learning and decodes latent representations to robot motion control commands. The learned latent space exhibits expressiveness as it captures the motions precisely and allows direct motion control in the latent space. We showcase how to generate in-between motion through simple linear interpolation in the latent space between two projected human poses. Additionally, we conducted a comprehensive evaluation of robot control using diverse modality inputs, such as texts, RGB videos, and key-poses, which enhances the ease of robot control to users of all backgrounds. Finally, we compare our model with existing works and quantitatively and qualitatively demonstrate the effectiveness of our approach, enhancing natural human-robot communication and fostering trust in integrating robots into daily life.
翻译:本文提出了一种新颖的人-机器人运动重定向方法,使机器人能够精确模仿人类运动,同时保持运动的语义。为此,我们提出了一种从人类运动直接翻译为机器人运动的深度学习方法。该方法无需标注的人-机器人配对运动数据,从而降低了适配新机器人时的工作量。为实现这一目标,我们首先提出了一种跨域相似度度量,用于比较不同域(即人类和机器人)的姿态。随后,我们的方法通过对比学习构建共享潜空间,并将潜表示解码为机器人运动控制指令。所学习的潜空间具有表达性,能够精确捕获运动,并允许在潜空间中直接进行运动控制。我们展示了如何通过在两个人姿态投影的潜空间中进行简单的线性插值来生成中间运动。此外,我们使用多种模态输入(如文本、RGB视频和关键姿态)对机器人控制进行了全面评估,这增强了不同背景用户对机器人控制的易用性。最后,我们将我们的模型与现有工作进行了对比,并通过定量和定性实验证明了方法的有效性,从而促进了自然的人机交流并增强了将机器人融入日常生活的信任感。