Household robots operate in the same space for years. Such robots incrementally build dynamic maps that can be used for tasks requiring remote object localization. However, benchmarks in robot learning often test generalization through inference on tasks in unobserved environments. In an observed environment, locating an object is reduced to choosing from among all object proposals in the environment, which may number in the 100,000s. Armed with this intuition, using only a generic vision-language scoring model with minor modifications for 3d encoding and operating in an embodied environment, we demonstrate an absolute performance gain of 9.84% on remote object grounding above state of the art models for REVERIE and of 5.04% on FAO. When allowed to pre-explore an environment, we also exceed the previous state of the art pre-exploration method on REVERIE. Additionally, we demonstrate our model on a real-world TurtleBot platform, highlighting the simplicity and usefulness of the approach. Our analysis outlines a "bag of tricks" essential for accomplishing this task, from utilizing 3d coordinates and context, to generalizing vision-language models to large 3d search spaces.
翻译:家庭机器人在同一空间中长期运行。此类机器人可逐步构建动态地图,用于需要远程目标定位的任务。然而,机器人学习领域的基准测试通常通过在未观测环境中的推理任务来检验模型泛化能力。在已观测环境中,定位目标简化为从环境内所有目标提案(数量可达数十万)中做出选择。基于这一直觉,我们仅使用通用视觉-语言评分模型(针对三维编码和具身环境进行微调),在REVERIE基准的远程目标定位任务中较现有最优模型提升绝对值9.84%,在FAO基准上提升5.04%。当允许对环境进行预探索时,我们还在REVERIE上超越了先前最先进的预探索方法。此外,我们在真实世界的TurtleBot平台上验证了模型效果,凸显了该方法的简洁性与实用性。我们的分析总结了完成该任务的"技巧集"——从利用三维坐标与上下文信息,到将视觉-语言模型泛化至大规模三维搜索空间的关键技术。