In order for robots to follow open-ended instructions like "go open the brown cabinet over the sink", they require an understanding of both the scene geometry and the semantics of their environment. Robotic systems often handle these through separate pipelines, sometimes using very different representation spaces, which can be suboptimal when the two objectives conflict. In this work, we present "method", a simple method for constructing a world representation that encodes both the semantics and spatial affordances of a scene in a differentiable map. This allows us to build a gradient-based planner which can navigate to locations in the scene specified using open-ended vocabulary. We use this planner to consistently generate trajectories which are both shorter 5-10% shorter and 10-30% closer to our goal query in CLIP embedding space than paths from comparable grid-based planners which don't leverage gradient information. To our knowledge, this is the first end-to-end differentiable planner optimizes for both semantics and affordance in a single implicit map. Code and visuals are available at our website: https://usa.bolte.cc/
翻译:为使机器人能够遵循类似“去打开水槽上方那个棕色橱柜”的开放式指令,机器人需要同时理解场景几何结构和环境语义。现有机器人系统通常通过独立管线处理这两类信息,有时甚至采用截然不同的表征空间,当语义目标与空间目标冲突时可能产生次优结果。本文提出"method"方法,这是一种构建世界表征的简洁方案,能够在可微地图中同时编码场景的语义信息和空间可供性。基于此,我们构建了梯度规划器,可导航至由开放式词汇指定的场景位置。该规划器能持续生成优于传统网格规划器的轨迹:路径长度缩短5-10%,且在CLIP嵌入空间中与目标查询的语义距离减小10-30%。据我们所知,这是首个在单一隐式地图中同时优化语义与可供性的端到端可微规划器。代码与可视化演示详见项目网站:https://usa.bolte.cc/