Active perception, the ability of a robot to proactively adjust its viewpoint to acquire task-relevant information, is essential for robust operation in unstructured real-world environments. While critical for downstream tasks such as manipulation, existing approaches have largely been confined to local settings (e.g., table-top scenes) with fixed perception objectives (e.g., occlusion reduction). Addressing active perception with open-ended intents in large-scale environments remains an open challenge. To bridge this gap, we propose I-Perceive, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments. I-Perceive predicts camera views that follows open-ended language instructions, based on image-based scene contexts. By fusing a Vision-Language Model (VLM) backbone with a geometric foundation model, I-Perceive bridges semantic and geometric understanding, thus enabling effective reasoning for active perception. We train I-Perceive on a diverse dataset comprising real-world scene-scanning data and simulation data, both processed via an automated and scalable data generation pipeline. Experiments demonstrate that I-Perceive significantly outperforms state-of-the-art VLMs in both prediction accuracy and instruction following of generated camera views, and exhibits strong zero-shot generalization to novel scenes and tasks.
翻译:主动感知是机器人主动调整视角以获取任务相关信息的能力,对于在非结构化现实环境中实现鲁棒操作至关重要。尽管该能力对下游任务(如操作)极为关键,现有方法大多局限于局部场景(如桌面环境)且感知目标固定(如减少遮挡)。如何在大规模环境中实现基于开放式意图的主动感知仍是一个开放挑战。为弥补这一差距,我们提出I-Perceive——一种基于自然语言指令的主动感知基础模型,专为移动操作机器人与室内环境设计。I-Perceive能够根据基于图像的场景上下文,预测符合开放式语言指令的相机视角。通过融合视觉-语言模型(VLM)主干与几何基础模型,I-Perceive实现了语义理解与几何理解的统一,从而为主动感知提供有效的推理能力。我们在包含真实世界场景扫描数据与仿真数据的多样化数据集上训练I-Perceive,所有数据均通过自动化可扩展的数据生成流程处理。实验表明,I-Perceive在生成相机视角的预测精度与指令遵循方面显著优于当前最先进的VLM,并在新场景与新任务上展现出强大的零样本泛化能力。