Object search is a fundamental skill for household robots, yet the core problem lies in the robot's ability to locate the target object accurately. The dynamic nature of household environments, characterized by the arbitrary placement of daily objects by users, makes it challenging to perform target localization. To efficiently locate the target object, the robot needs to be equipped with knowledge at both the object and room level. However, existing approaches rely solely on one type of knowledge, leading to unsatisfactory object localization performance and, consequently, inefficient object search processes. To address this problem, we propose a commonsense scene graph-based target localization, CSG-TL, to enhance target object search in the household environment. Given the pre-built map with stationary items, the robot models the room-level knowledge with object-level commonsense knowledge generated by a large language model (LLM) to a commonsense scene graph (CSG), supporting both types of knowledge for CSG-TL. To demonstrate the superiority of CSG-TL on target localization, extensive experiments are performed on the real-world ScanNet dataset and the AI2THOR simulator. Moreover, we have extended CSG-TL to an object search framework, CSG-OS, validated in both simulated and real-world environments. Code and videos are available at https://sites.google.com/view/csg-os.
翻译:物体搜索是家用机器人的一项基本技能,但其核心问题在于机器人能否准确定位目标物体。家庭环境的动态特性表现为用户对日常物品的随意摆放,这使得目标定位具有挑战性。为了高效定位目标物体,机器人需要同时具备物体层面和房间层面的知识。然而,现有方法仅依赖单一类型的知识,导致物体定位性能不佳,进而造成物体搜索过程效率低下。为解决此问题,我们提出了一种基于常识场景图的目标定位方法CSG-TL,以增强家庭环境中的目标物体搜索能力。给定包含固定物品的预建地图,机器人将房间层面的知识与由大语言模型生成的物体层面常识知识结合,构建为常识场景图,从而为CSG-TL提供双重知识支持。为验证CSG-TL在目标定位方面的优越性,我们在真实世界ScanNet数据集和AI2THOR模拟器上进行了大量实验。此外,我们将CSG-TL扩展为一个物体搜索框架CSG-OS,并在模拟和真实环境中进行了验证。代码与演示视频详见https://sites.google.com/view/csg-os。