The Multi-Object Search (MOS) problem involves navigating to a sequence of locations to maximize the likelihood of finding target objects while minimizing travel costs. In this paper, we introduce a novel approach to the MOS problem, called Finder, which leverages vision language models (VLMs) to locate multiple objects across diverse environments. Specifically, our approach introduces multi-channel score maps to track and reason about multiple objects simultaneously during navigation, along with a score map technique that combines scene-level and object-level semantic correlations. Experiments in both simulated and real-world settings showed that Finder outperforms existing methods using deep reinforcement learning and VLMs. Ablation and scalability studies further validated our design choices and robustness with increasing numbers of target objects, respectively. Website: https://find-all-my-things.github.io/
翻译:多目标搜索(MOS)问题旨在通过规划一系列访问位置,以最大化发现目标物体的概率,同时最小化移动代价。本文提出一种名为Finder的新型MOS求解方法,该方法利用视觉语言模型(VLMs)在多样化环境中定位多个目标物体。具体而言,我们的方法引入了多通道得分地图,在导航过程中同步追踪和推理多个目标,并结合了场景层级与物体层级语义关联的得分地图技术。在仿真和真实环境中的实验表明,Finder在性能上超越了现有基于深度强化学习和VLMs的方法。消融实验与可扩展性研究分别验证了我们的设计选择的有效性以及随着目标物体数量增加时方法的鲁棒性。项目网站:https://find-all-my-things.github.io/