IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals

Semantic Scene Completion (SSC) has emerged as a pivotal approach for jointly learning scene geometry and semantics, enabling downstream applications such as navigation in mobile robotics. The recent generalization to Panoptic Scene Completion (PSC) advances the SSC domain by integrating instance-level information, thereby enhancing object-level sensitivity in scene understanding. While PSC was introduced using LiDAR modality, methods based on camera images remain largely unexplored. Moreover, recent Transformer-based approaches utilize a fixed set of learned queries to reconstruct objects within the scene volume. Although these queries are typically updated with image context during training, they remain static at test time, limiting their ability to dynamically adapt specifically to the observed scene. To overcome these limitations, we propose IPFormer, the first method that leverages context-adaptive instance proposals at train and test time to address vision-based 3D Panoptic Scene Completion. Specifically, IPFormer adaptively initializes these queries as panoptic instance proposals derived from image context and further refines them through attention-based encoding and decoding to reason about semantic instance-voxel relationships. Extensive experimental results show that our approach achieves state-of-the-art in-domain performance, exhibits superior zero-shot generalization on out-of-domain data, and achieves a runtime reduction exceeding 14x. These results highlight our introduction of context-adaptive instance proposals as a pioneering effort in addressing vision-based 3D Panoptic Scene Completion. Code available at https://github.com/markus-42/ipformer.

翻译：语义场景补全（SSC）已成为联合学习场景几何与语义的关键方法，推动了移动机器人导航等下游应用。近期向全景场景补全（PSC）的推广通过整合实例级信息，提升了场景理解中对象层面的感知能力，从而推进了SSC领域的发展。尽管PSC最初是基于LiDAR模态提出的，但基于相机图像的方法仍鲜有探索。此外，近期基于Transformer的方法通常使用一组固定的学习查询来重建场景体素内的对象。虽然这些查询在训练过程中会结合图像上下文进行更新，但在测试时保持静态，限制了其针对观测场景进行动态自适应调整的能力。为克服这些限制，我们提出了IPFormer，这是首个在训练和测试阶段均利用上下文自适应实例提议来解决基于视觉的三维全景场景补全问题的方法。具体而言，IPFormer自适应地将这些查询初始化为源自图像上下文的全景实例提议，并通过基于注意力的编码与解码过程进一步优化，以推理语义实例与体素之间的关系。大量实验结果表明，我们的方法在领域内达到了最先进的性能，在领域外数据上展现出卓越的零样本泛化能力，并实现了超过14倍的运行时间缩减。这些结果凸显了我们引入上下文自适应实例提议作为解决基于视觉的三维全景场景补全问题的开创性贡献。代码发布于 https://github.com/markus-42/ipformer。