In this paper, we extended the method proposed in [17] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action. The video demonstrations of this paper can be found at https://linusnep.github.io/MTCC-IRoNL/.
翻译:本文扩展了文献[17]提出的方法,使人类能够通过语音和文本对话与自主智能体进行自然交互。所提出的扩展方法利用预训练大语言模型(LLMs)、多模态视觉语言模型(VLMs)和语音识别(SR)模型的内在能力,解码高层次的自然语言对话及对机器人任务环境的语义理解,并将其抽象为机器人的可执行命令或查询。我们针对不同种族背景和英语口音的参与者,对框架的自然语音对话理解能力进行了定量评估。参与者通过语音和文本指令两种方式与机器人交互。根据记录的交互数据,我们的框架实现了87.55%的语音指令解码准确率、86.27%的命令执行成功率,以及从接收参与者语音聊天指令到启动机器人实际物理动作平均0.89秒的延迟。本文的视频演示可参见 https://linusnep.github.io/MTCC-IRoNL/。