PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.

翻译：摘要: 对同一室内空间在不同时间捕获的两组数据进行配准，是机器人与增强现实系统实现持久空间记忆的基础，但该任务最具现实意义的形式是自我中心的（egocentric），且最可扩展的实现方式仅依赖于RGB数据。头戴式摄像头产生的图像具有模糊、快速移动、部分重叠等特点，难以从中恢复密集几何结构。传统配准方法依赖于精确的点云数据——而这正是该场景所缺乏的；而基于学习的方法需要预先构建或标注的场景图，以及我们发现在自我中心数据下性能脆弱的训练好的匹配器。我们另辟蹊径，利用预训练的视觉语言模型（VLM）同时作为场景理解与跨扫描匹配的源头。我们的方法PROSE（Prompted Scene rEgistration）通过现成的基础模型处理几何、分割与语言，将每段RGB序列提升为对象级3D场景图，随后引导同一VLM对两段RGB序列中的对象实例进行匹配。为使该匹配过程可行且可靠，我们利用目标高度作为先验，并通过成对的相同/不同查询验证每个候选匹配，随后对每个匹配对象假设一个候选变换，最终选择几何一致性最强的变换求解刚体变换。PROSE无需新增可学习参数，不依赖深度传感器、训练数据或标注场景图。在自我中心的Aria Digital Twin和Aria Everyday Activities基准测试中，无论是在真实点云还是RGB重建点云上，该方法在配准精度上均优于几何方法与基于学习场景图的基线方法，其生成的场景图可直接迁移至下游任务。