We present the first approach to build hierarchical task-driven 3D scene graphs of arbitrary indoor or outdoor environments using an uncalibrated monocular camera in real-time. We leverage geometric foundation models to estimate geometric attributes of the scene graph (e.g., object bounding boxes), but we also observe that traversability information (the "places" layer of a scene graph) can be directly reconstructed by adding an extra head to existing geometric foundation models, like VGGT. Our approach is task-driven in the sense that we adjust the granularity of the objects and regions in the map depending on the task; for instance, during a manipulation task, our approach is able to resolve small knobs on a stove, while during a navigation task it can focus on large objects (e.g., the entire stove). However, in a major departure from related work, we consider the realistic case where the list of tasks is not predefined and fixed, but evolves as the robot operates. This naturally allows dealing with complex loco-manipulation tasks, where the robot can dynamically adjust its representation as the task unfolds. We dub the resulting approach FOUND-IT. FOUND-IT also includes an agentic approach to query information in the scene graph. In addition to achieving 79% higher accuracy on the ASHiTA SG3D task grounding benchmark, we demonstrate FOUND-IT runs in real-time on a ground robot using a Jetson Thor. Furthermore, to highlight the robustness of our method, we demonstrate constructing 3D scene graphs on casually captured realtor apartment tours from YouTube. Code will be made available upon publication.
翻译:我们提出了首个利用未标定单目相机实时构建任意室内或室外环境层次化任务驱动3D场景图的方法。该方法利用几何基础模型估计场景图的几何属性(如物体边界框),同时发现可通过为现有几何基础模型(如VGGT)添加额外头部直接重建可通行性信息(场景图的“地点”层)。我们方法的任务驱动性体现在:根据任务需求调整地图中物体与区域的粒度——例如,在执行操作任务时能解析炉灶上的小型旋钮,而导航任务期间则聚焦于大型物体(如整个炉灶)。但与现有研究显著不同的是,我们考虑了任务列表非预定义固定、而是随机器人运行动态演化的现实场景。这自然支持复杂的地面-操作任务,使机器人能随任务推进动态调整自身表征。我们将所提方法命名为FOUND-IT。FOUND-IT还包含一种智能体式查询场景图信息的方法。除在ASHiTA SG3D任务定位基准上实现79%的准确率提升外,我们还在Jetson Thor平台上验证了FOUND-IT在地面机器人上的实时运行能力。此外,为凸显方法鲁棒性,我们展示了利用YouTube上随意拍摄的公寓实景构建3D场景图。代码将在论文发表后开源。