IPFormer：基于上下文自适应实例提议的视觉三维全景场景补全 (IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals)

Semantic Scene Completion (SSC) has emerged as a pivotal approach for jointly learning scene geometry and semantics, enabling downstream applications such as navigation in mobile robotics. The recent generalization to Panoptic Scene Completion (PSC) advances the SSC domain by integrating instance-level information, thereby enhancing object-level sensitivity in scene understanding. While PSC was introduced using LiDAR modality, methods based on camera images remain largely unexplored. Moreover, recent Transformer-based approaches utilize a fixed set of learned queries to reconstruct objects within the scene volume. Although these queries are typically updated with image context during training, they remain static at test time, limiting their ability to dynamically adapt specifically to the observed scene. To overcome these limitations, we propose IPFormer, the first method that leverages context-adaptive instance proposals at train and test time to address vision-based 3D Panoptic Scene Completion. Specifically, IPFormer adaptively initializes these queries as panoptic instance proposals derived from image context and further refines them through attention-based encoding and decoding to reason about semantic instance-voxel relationships. Extensive experimental results show that our approach achieves state-of-the-art in-domain performance, exhibits superior zero-shot generalization on out-of-domain data, and achieves a runtime reduction exceeding 14x. These results highlight our introduction of context-adaptive instance proposals as a pioneering effort in addressing vision-based 3D Panoptic Scene Completion.

翻译：语义场景补全已成为联合学习场景几何与语义的关键方法，为移动机器人导航等下游应用提供了支持。近期向全景场景补全的推广通过整合实例级信息，增强了场景理解中的对象级敏感性，从而推进了语义场景补全领域的发展。虽然全景场景补全最初基于激光雷达模态提出，但基于相机图像的方法仍鲜有探索。此外，近期基于Transformer的方法使用一组固定的学习查询来重建场景体素内的对象。尽管这些查询在训练期间通常会根据图像上下文进行更新，但在测试时它们保持静态，这限制了其针对观测场景进行动态适应的能力。为克服这些限制，我们提出了IPFormer，这是首个在训练和测试时利用上下文自适应实例提议来解决基于视觉的三维全景场景补全问题的方法。具体而言，IPFormer自适应地将这些查询初始化为源自图像上下文的全景实例提议，并通过基于注意力的编码与解码进一步优化它们，以推理语义实例-体素关系。大量实验结果表明，我们的方法在领域内性能达到最先进水平，在领域外数据上展现出卓越的零样本泛化能力，并实现了超过14倍的运行时间缩减。这些成果凸显了我们引入上下文自适应实例提议作为解决基于视觉的三维全景场景补全问题的开创性努力。