Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-relevant semantics needed for safe navigation and goal-directed action. We ask whether off-the-shelf pretrained generative vision models can derive this missing structure as zero-shot offline priors for robot reasoning. Such priors should support spatio-semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM-guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D-derived benchmark of doorway-occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object-reaching tasks. Our results suggest that useful spatio-semantic priors for planning can be derived without problem-specific fine-tuning.
翻译:自主机器人常通过门廊仅能部分观察房间,墙壁与场景结构遮挡了安全导航与目标导向行动所需的几何信息及任务相关语义。我们探究预训练的即用型生成视觉模型能否作为零样本离线先验,为机器人推理提供缺失结构信息。此类先验应支持对未观测结构的空间-语义查询,估计隐藏区域中目标物体的似然性及该区域被占据的概率。基于以自我为中心的RGB观测与目标查询,我们的流程采用VLM引导的外推绘制、单目深度估计与语义分割,为隐藏房间生成语义标记的3D点云假设样本。我们提出MatterDoor——基于Matterport3D的室内场景门廊遮挡基准数据集,并通过生成指标及模拟Stretch机器人目标拾取任务评估所得先验。结果表明,无需针对特定任务进行微调即可获得用于规划的有效空间-语义先验。