Gestures serve as a fundamental and significant mode of non-verbal communication among humans. Deictic gestures (such as pointing towards an object), in particular, offer valuable means of efficiently expressing intent in situations where language is inaccessible, restricted, or highly specialized. As a result, it is essential for robots to comprehend gestures in order to infer human intentions and establish more effective coordination with them. Prior work often rely on a rigid hand-coded library of gestures along with their meanings. However, interpretation of gestures is often context-dependent, requiring more flexibility and common-sense reasoning. In this work, we propose a framework, GIRAF, for more flexibly interpreting gesture and language instructions by leveraging the power of large language models. Our framework is able to accurately infer human intent and contextualize the meaning of their gestures for more effective human-robot collaboration. We instantiate the framework for interpreting deictic gestures in table-top manipulation tasks and demonstrate that it is both effective and preferred by users, achieving 70% higher success rates than the baseline. We further demonstrate GIRAF's ability on reasoning about diverse types of gestures by curating a GestureInstruct dataset consisting of 36 different task scenarios. GIRAF achieved 81% success rate on finding the correct plan for tasks in GestureInstruct. Website: https://tinyurl.com/giraf23
翻译:姿态是人类非语言沟通中一种基础且重要的方式。指示性姿态(如指向物体)尤其在语言不可用、受限或高度专业化的情况下,提供了高效表达意图的宝贵手段。因此,机器人必须具备理解姿态的能力,以推断人类意图并建立更有效的协作。先前的研究通常依赖一套固定的手工编码姿态库及其对应含义。然而,姿态的解读通常依赖于上下文,需要更高的灵活性和常识推理能力。在本研究中,我们提出一个名为GIRAF的框架,通过利用大型语言模型的能力,更灵活地解读姿态和语言指令。该框架能够准确推断人类意图,并将姿态含义置于上下文中,以实现更高效的人机协作。我们实例化了该框架,用于解读桌面操作任务中的指示性姿态,并证明了其有效性及用户偏好性,成功率为基线的70%。此外,通过构建包含36种不同任务场景的GestureInstruct数据集,我们进一步展示了GIRAF对多种类型姿态的推理能力。GIRAF在GestureInstruct中为任务找到正确方案的成功率达到81%。网站:https://tinyurl.com/giraf23