Large Language Models (LLMs) have substantially improved the conversational capabilities of social robots. Nevertheless, for an intuitive and fluent human-robot interaction, robots should be able to ground the conversation by relating ambiguous or underspecified spoken utterances to the current physical situation and to the intents expressed nonverbally by the user, such as through referential gaze. Here, we propose a representation that integrates speech and gaze to enable LLMs to achieve higher situated awareness and correctly resolve ambiguous requests. Our approach relies on a text-based semantic translation of the scanpath produced by the user, along with the verbal requests. It demonstrates LLMs' capabilities to reason about gaze behavior, robustly ignoring spurious glances or irrelevant objects. We validate the system across multiple tasks and two scenarios, showing its superior generality and accuracy compared to control conditions. We demonstrate an implementation on a robotic platform, closing the loop from request interpretation to execution.
翻译:大型语言模型(LLMs)显著提升了社交机器人的对话能力。然而,为实现直观流畅的人机交互,机器人需能将模糊或未明确表述的语音话语与当前物理情境及用户非语言表达的意图(如参考性注视)进行关联,从而建立对话的语境基础。本文提出一种融合语音与注视的表示方法,使LLMs能够获得更高情境感知能力,并正确解析歧义请求。该方法基于用户扫描路径的文本化语义转换与言语请求相结合,论证了LLMs对注视行为进行推理的能力,可稳健忽略偶然注视或无关物体。我们在多重任务与两个场景中验证了该系统,相较于对照条件展现出更优的通用性与准确性。最终在机器人平台上完成闭环实现,涵盖从请求解析到动作执行的完整流程。