We explore long-term temporal visual correspondence-based optimization for 3D video object detection in this work. Visual correspondence refers to one-to-one mappings for pixels across multiple images. Correspondence-based optimization is the cornerstone for 3D scene reconstruction but is less studied in 3D video object detection, because moving objects violate multi-view geometry constraints and are treated as outliers during scene reconstruction. We address this issue by treating objects as first-class citizens during correspondence-based optimization. In this work, we propose BA-Det, an end-to-end optimizable object detector with object-centric temporal correspondence learning and featuremetric object bundle adjustment. Empirically, we verify the effectiveness and efficiency of BA-Det for multiple baseline 3D detectors under various setups. Our BA-Det achieves SOTA performance on the large-scale Waymo Open Dataset (WOD) with only marginal computation cost. Our code is available at https://github.com/jiaweihe1996/BA-Det.
翻译:本工作探索基于长期时间视觉对应优化的3D视频目标检测。视觉对应指跨多幅图像的像素间一一映射关系。基于对应的优化是3D场景重建的基石,但在3D视频目标检测中研究较少,因为运动物体会违背多视图几何约束,在场景重建中被视为离群点。我们通过将物体视为对应优化过程中的"一等公民"来解决该问题。本文提出BA-Det——一种端到端可优化的目标检测器,其具备以目标为中心的时间对应学习与特征度量物体束调整功能。实验表明,BA-Det在多种设置下对多个基线3D检测器均展现出有效性和高效性。在大规模Waymo开放数据集(WOD)上,BA-Det仅以极小的计算开销便达到最先进性能。我们的代码已开源至 https://github.com/jiaweihe1996/BA-Det。