In natural human-to-human communication, multimodal user input is typically used to supplement explicit and complement implicit voice commands, with casualness allowing for flexible input modality combinations and tolerance for imprecise input data. For example, saying "I want that." with a casual glance at a bottle of water is clear enough in human-to-human communication as an implicit voice command accompanied by gaze and/or gestures, rather than an explicit one. To enable such a human-like interaction in human-robot interaction (HRI), we propose a system, IntenBot, to understand user intentions from flexible and imprecise multimodal input, including voice, gaze, and finger-pointing, in XR. The disambiguation capability of large language models (LLMs) is used to filter out irrelevant input modalities and imprecise input data, generating potential instructions for user confirmation. The flexible and imprecise multimodal input enables casual, human-like interaction with robots, reducing time, effort, and attention, and could also be used as non-voice input. We conducted an informative user behavior study in a simulated environment to understand users' natural be- havior in flexibly interacting with a robot using multimodal input and to obtain appropriate angle range parameters for gaze and finger-pointing. An XR study was then performed to evaluate the performance of IntenBot, compared with other methods. We also deployed IntenBot on a physical robot to showcase its real-world applications.
翻译:在自然的人际沟通中,多模态用户输入通常用于补充显式指令并补充隐式语音命令,其随意性允许多种输入模态的灵活组合以及对不精确输入数据的容忍。例如,在人际交流中,说“我想要那个”并随意扫视一瓶水,作为以注视和/或手势伴随的隐式语音命令(而非显式指令),其含义已足够清晰。为在人机交互中实现此类类人交互,我们提出IntenBot系统,旨在从扩展现实环境下包括语音、注视和手指指向的灵活且不精确的多模态输入中理解用户意图。利用大语言模型的消歧能力滤除无关输入模态与不精确输入数据,生成候选指令供用户确认。这种灵活且不精确的多模态输入支持与机器人的自然类人交互,减少时间、精力和注意力消耗,并可作为非语音输入使用。我们在模拟环境中开展了信息性用户行为研究,以理解用户使用多模态输入灵活与机器人交互的自然行为,并获取注视与手指指向的合适角度范围参数。随后通过扩展现实实验评估IntenBot相较其他方法的性能。我们还将IntenBot部署于实体机器人,展示其实际应用场景。