Considering the complementarity of scene flow estimation in the spatial domain's focusing capability and 3D object tracking in the temporal domain's coherence, this study aims to address a comprehensive new task that can simultaneously capture fine-grained and long-term 3D motion in an online manner: long-term scene flow estimation (LSFE). We introduce SceneTracker, a novel learning-based LSFE network that adopts an iterative approach to approximate the optimal trajectory. Besides, it dynamically indexes and constructs appearance and depth correlation features simultaneously and employs the Transformer to explore and utilize long-range connections within and between trajectories. With detailed experiments, SceneTracker shows superior capabilities in handling 3D spatial occlusion and depth noise interference, highly tailored to the LSFE task's needs. Finally, we build the first real-world evaluation dataset, LSFDriving, further substantiating SceneTracker's commendable generalization capacity. The code and data for SceneTracker is available at https://github.com/wwsource/SceneTracker.
翻译:考虑到场景流估计在空间域聚焦能力与3D目标跟踪在时间域一致性的互补性,本研究旨在解决一项综合性新任务:以在线方式同时捕捉细粒度且长期的3D运动,即长期场景流估计(LSFE)。我们提出SceneTracker——一种基于学习的新型LSFE网络,采用迭代方法逼近最优轨迹。此外,该网络能动态索引并同步构建外观与深度相关特征,并利用Transformer探索和利用轨迹内部及轨迹之间的长程关联。通过详细实验表明,SceneTracker在处理3D空间遮挡与深度噪声干扰方面展现出卓越能力,高度契合LSFE任务需求。最终,我们构建了首个真实场景评估数据集LSFDriving,进一步验证了SceneTracker良好的泛化能力。SceneTracker的代码与数据已在https://github.com/wwsource/SceneTracker开源。