Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot.
翻译:基于人类意图的系统使机器人能够感知和解释用户行为,从而与人类互动并主动适应其行为。因此,在人类设计的环境中,意图预测对于实现与社交机器人的自然交互至关重要。本文研究了在协作物体分类任务中,利用大型语言模型(LLMs)推断人类与物理机器人交互意图的方法。我们提出了一种新颖的多模态方法,该方法在分层架构中整合了用户的非语言线索(如手势、身体姿态和面部表情)、环境状态以及用户语言线索,以预测用户意图。我们对五种LLMs的评估表明,这些模型在理解语言与非语言用户线索方面具有潜力,能够利用其上下文理解能力和现实世界知识,支持在与社交机器人协作完成任务过程中的意图预测。