Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.
翻译:室内环境会随着物体的移动、出现或消失而演变。捕捉这些动态过程需要在间歇性采集的三维扫描之间保持时序一致的实例身份,即使变化未被直接观测到。本文提出并形式化了时序稀疏四维室内语义实例分割任务,该任务联合执行物体实例的分割、识别与时序关联。这一设定对现有三维语义实例分割方法构成了挑战,因其缺乏时序推理而需依赖离散匹配步骤;同时,对四维激光雷达方法也提出了难题,因其依赖高频时序测量,而这在室内环境的长时程演变中并不常见。我们提出ReScene4D这一新方法,它能使三维语义实例分割架构适配于四维语义实例分割任务,且无需密集观测。该方法探索了跨观测信息共享策略,证明共享上下文不仅能实现一致的实例追踪,还能提升标准三维语义实例分割的质量。为评估该任务,我们定义了新指标t-mAP,将mAP扩展为奖励时序身份一致性的度量。ReScene4D在3RScan数据集上取得了最先进的性能,为理解动态室内场景建立了新的基准。