Research into dynamic 3D scene understanding has primarily focused on short-term change tracking from dense observations, while little attention has been paid to long-term changes with sparse observations. We address this gap with MoRE, a novel approach for multi-object relocalization and reconstruction in evolving environments. We view these environments as "living scenes" and consider the problem of transforming scans taken at different points in time into a 3D reconstruction of the object instances, whose accuracy and completeness increase over time. At the core of our method lies an SE(3)-equivariant representation in a single encoder-decoder network, trained on synthetic data. This representation enables us to seamlessly tackle instance matching, registration, and reconstruction. We also introduce a joint optimization algorithm that facilitates the accumulation of point clouds originating from the same instance across multiple scans taken at different points in time. We validate our method on synthetic and real-world data and demonstrate state-of-the-art performance in both end-to-end performance and individual subtasks.
翻译:动态3D场景理解的研究主要集中于通过密集观测进行短期变化追踪,而针对稀疏观测下的长期变化关注甚少。我们通过MoRE方法填补了这一空白——一种面向演化环境中多目标重定位与重建的新方法。我们将这些环境视为"生活场景",并考虑将不同时间点采集的扫描数据转化为随时间推移其精度与完整性不断提升的目标实例3D重建的问题。该方法的核心在于单个编码器-解码器网络中的SE(3)-等变表示,该网络基于合成数据训练。这种表示使我们能够无缝处理实例匹配、配准与重建。我们还提出了一种联合优化算法,促进来自同一实例在不同时间点多帧扫描中采集的点云累积。我们在合成数据与真实数据上验证了该方法,并在端到端性能及各个子任务中均展现出最先进水平。