When assisting people in daily tasks, robots need to accurately interpret visual cues and respond effectively in diverse safety-critical situations, such as sharp objects on the floor. In this context, we present M-CoDAL, a multimodal-dialogue system specifically designed for embodied agents to better understand and communicate in safety-critical situations. The system leverages discourse coherence relations to enhance its contextual understanding and communication abilities. To train this system, we introduce a novel clustering-based active learning mechanism that utilizes an external Large Language Model (LLM) to identify informative instances. Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images. These violations are annotated using a Large Multimodal Model (LMM) and verified by human annotators. Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation. Next, we deploy our dialogue system on a Hello Robot Stretch robot and conduct a within-subject user study with real-world participants. In the study, participants role-play two safety scenarios with different levels of severity with the robot and receive interventions from our model and a baseline system powered by OpenAI's ChatGPT. The study results corroborate and extend the findings from automated evaluation, showing that our proposed system is more persuasive and competent in a real-world embodied agent setting.
翻译:在协助人们完成日常任务时,机器人需要准确解读视觉线索,并在多样化的安全关键场景(例如地面上的尖锐物体)中做出有效响应。为此,我们提出了M-CoDAL,一个专为具身智能体设计的、用于在安全关键情境中更好地理解和沟通的多模态对话系统。该系统利用话语连贯关系来增强其上下文理解与沟通能力。为了训练该系统,我们引入了一种新颖的基于聚类的主动学习机制,该机制利用外部大型语言模型(LLM)来识别信息丰富的实例。我们使用一个新创建的多模态数据集对方法进行评估,该数据集包含从2K张Reddit图像中提取的1K个安全违规实例。这些违规实例通过大型多模态模型(LMM)进行标注,并由人工标注者验证。使用该数据集的实验结果表明,我们的方法提升了安全情境的解决能力、用户情感体验以及对话的安全性。随后,我们将对话系统部署在Hello Robot Stretch机器人上,并开展了一项有真实参与者参与的组内用户研究。在研究中,参与者与机器人进行两个不同严重程度的安全场景角色扮演,并接受我们的模型与一个由OpenAI的ChatGPT驱动的基线系统的干预。研究结果证实并扩展了自动评估的发现,表明我们提出的系统在真实世界的具身智能体环境中更具说服力和胜任力。