In this paper, we introduce Robi Butler, a novel household robotic system that enables multimodal interactions with remote users. Building on the advanced communication interfaces, Robi Butler allows users to monitor the robot's status, send text or voice instructions, and select target objects by hand pointing. At the core of our system is a high-level behavior module, powered by Large Language Models (LLMs), that interprets multimodal instructions to generate action plans. These plans are composed of a set of open vocabulary primitives supported by Vision Language Models (VLMs) that handle both text and pointing queries. The integration of the above components allows Robi Butler to ground remote multimodal instructions in the real-world home environment in a zero-shot manner. We demonstrate the effectiveness and efficiency of this system using a variety of daily household tasks that involve remote users giving multimodal instructions. Additionally, we conducted a user study to analyze how multimodal interactions affect efficiency and user experience during remote human-robot interaction and discuss the potential improvements.
翻译:本文介绍了Robi Butler——一种支持远程用户进行多模态交互的新型家庭机器人系统。基于先进的通信接口,Robi Butler允许用户监控机器人状态、发送文本或语音指令,并通过手势指向选择目标物体。系统的核心是依托大语言模型(LLMs)构建的高层行为模块,该模块通过解析多模态指令生成动作规划。这些规划由一系列开放词汇基元组成,这些基元由能够处理文本与指向查询的视觉语言模型(VLMs)提供支持。上述组件的集成使Robi Butler能够以零样本方式将远程多模态指令落实到真实家庭环境中。我们通过多种涉及远程用户发出多模态指令的日常家务任务验证了该系统的有效性与高效性。此外,我们开展了用户研究以分析多模态交互如何影响远程人机交互的效率和用户体验,并探讨了可能的改进方向。