Object co-occurrences provide a key cue for finding objects successfully and efficiently in unfamiliar environments. Typically, one looks for cups in kitchens and views fridges as evidence of being in a kitchen. Such priors have also been exploited in artificial agents, but they are typically learned from explicitly labeled data or queried from language models. It is still unclear whether these relations can be learned implicitly from unlabeled observations alone. In this work, we address this problem and propose ProReFF, a feature field model trained to predict relative distributions of features obtained from pre-trained vision language models. In addition, we introduce a learning-based strategy that enables training from unlabeled and potentially contradictory data by aligning inconsistent observations into a coherent relative distribution. For the downstream object search task, we propose an agent that leverages predicted feature distributions as a semantic prior to guide exploration toward regions with a high likelihood of containing the object. We present extensive evaluations demonstrating that ProReFF captures meaningful relative feature distributions in natural scenes and provides insight into the impact of our proposed alignment step. We further evaluate the performance of our search agent in 100 challenges in the Matterport3D simulator, comparing with feature-based baselines and human participants. The proposed agent is 20% more efficient than the strongest baseline and achieves up to 80% of human performance.
翻译:物体共现为在陌生环境中成功高效地定位物体提供了关键线索。通常,人们在厨房中寻找杯子,并将冰箱视为身处厨房的证据。此类先验知识也已在智能体中得到利用,但它们通常从显式标注的数据中学习或通过语言模型查询获得。目前尚不清楚这些关系能否仅从未标注的观测数据中隐式学习。本研究针对该问题,提出ProReFF——一种特征场模型,其训练目标为预测从预训练视觉语言模型中获取的特征的相对分布。此外,我们引入一种基于学习的策略,通过将不一致的观测对齐为连贯的相对分布,实现从未标注且可能存在矛盾的数据中进行训练。针对下游物体搜索任务,我们提出一种智能体,其利用预测的特征分布作为语义先验,引导探索朝向具有高目标物体存在概率的区域。我们通过大量评估证明,ProReFF能够捕捉自然场景中有意义的相对特征分布,并揭示了所提对齐步骤的影响。我们进一步在Matterport3D模拟器中通过100项挑战评估搜索智能体的性能,与基于特征的基线方法及人类参与者进行对比。所提出的智能体比最强基线效率提升20%,并达到人类性能的80%。