The advent of generalist Large Language Models (LLMs) and Large Vision Models (VLMs) have streamlined the construction of semantically enriched maps that can enable robots to ground high-level reasoning and planning into their representations. One of the most widely used semantic map formats is the 3D Scene Graph, which captures both metric (low-level) and semantic (high-level) information. However, these maps often assume a static world, while real environments, like homes and offices, are dynamic. Even small changes in these spaces can significantly impact task performance. To integrate robots into dynamic environments, they must detect changes and update the scene graph in real-time. This update process is inherently multimodal, requiring input from various sources, such as human agents, the robot's own perception system, time, and its actions. This work proposes a framework that leverages these multimodal inputs to maintain the consistency of scene graphs during real-time operation, presenting promising initial results and outlining a roadmap for future research.
翻译:通用型大型语言模型(LLM)与大型视觉模型(VLM)的出现,简化了语义增强地图的构建过程,使机器人能够将高层推理与规划任务锚定在其内部表征中。三维场景图作为最广泛使用的语义地图格式之一,同时捕捉了度量(低层)与语义(高层)信息。然而,此类地图通常假设世界是静态的,而真实环境(如家庭与办公室)实则是动态的。即使环境中发生微小变化,也可能显著影响任务执行效果。为使机器人融入动态环境,其必须能够检测变化并实时更新场景图。这一更新过程本质上是多模态的,需要整合来自人类智能体、机器人自身感知系统、时间信息及其动作等多种来源的输入。本研究提出一个框架,利用这些多模态输入来维持实时运行过程中场景图的一致性,展示了具有前景的初步成果,并勾勒了未来研究的路线图。