In this work, we explore how multimodal large language models can support real-time context- and value-aware decision-making. To do so, we combine the GPT-4o language model with a TurtleBot 4 platform simulating a smart vacuum cleaning robot in a home. The model evaluates the environment through vision input and determines whether it is appropriate to initiate cleaning. The system highlights the ability of these models to reason about domestic activities, social norms, and user preferences and take nuanced decisions aligned with the values of the people involved, such as cleanliness, comfort, and safety. We demonstrate the system in a realistic home environment, showing its ability to infer context and values from limited visual input. Our results highlight the promise of multimodal large language models in enhancing robotic autonomy and situational awareness, while also underscoring challenges related to consistency, bias, and real-time performance.
翻译:本研究探讨了多模态大语言模型如何支持基于实时情境与价值感知的决策。为此,我们将GPT-4o语言模型与模拟家庭智能吸尘机器人的TurtleBot 4平台相结合。该模型通过视觉输入评估环境,并判断是否适合启动清洁任务。该系统凸显了此类模型在家庭活动、社会规范与用户偏好方面的推理能力,能够根据相关人员的价值观(如清洁度、舒适度与安全性)做出细致决策。我们在真实家庭环境中演示了该系统,展示了其基于有限视觉输入推断情境与价值的能力。研究结果凸显了多模态大语言模型在提升机器人自主性与情境感知方面的潜力,同时也揭示了其在一致性、偏见与实时性能方面面临的挑战。