We introduce HandMeThat, a benchmark for a holistic evaluation of instruction understanding and following in physical and social environments. While previous datasets primarily focused on language grounding and planning, HandMeThat considers the resolution of human instructions with ambiguities based on the physical (object states and relations) and social (human actions and goals) information. HandMeThat contains 10,000 episodes of human-robot interactions. In each episode, the robot first observes a trajectory of human actions towards her internal goal. Next, the robot receives a human instruction and should take actions to accomplish the subgoal set through the instruction. In this paper, we present a textual interface for our benchmark, where the robot interacts with a virtual environment through textual commands. We evaluate several baseline models on HandMeThat, and show that both offline and online reinforcement learning algorithms perform poorly on HandMeThat, suggesting significant room for future work on physical and social human-robot communications and interactions.
翻译:我们提出了HandMeThat,一个用于在物理和社会环境中全面评估指令理解与遵循能力的基准测试。不同于以往主要聚焦于语言基础与规划的数据集,HandMeThat考虑了基于物理信息(物体状态与关系)和社会信息(人类行为与目标)解决含混人类指令的问题。该基准包含10,000个人机交互场景。在每个场景中,机器人首先观察人类为实现其内在目标所执行的行为轨迹,随后接收到人类指令,并需采取行动完成该指令设定的子目标。本文为基准测试提供了文本交互界面,机器人可通过文本指令与虚拟环境进行交互。我们在HandMeThat上评估了多个基线模型,发现离线与在线强化学习算法在该基准上表现均不理想,这表明未来在物理与社会环境下的人机通信与交互研究仍有较大提升空间。