3D multi-object tracking (MOT) is essential for an autonomous mobile agent to safely navigate a scene. In order to maximize the perception capabilities of the autonomous agent, we aim to develop a 3D MOT framework that fuses camera and LiDAR sensor information. Building on our prior LiDAR-only work, ShaSTA, which models shape and spatio-temporal affinities for 3D MOT, we propose a novel camera-LiDAR fusion approach for learning affinities. At its core, this work proposes a fusion technique that generates a rich sensory signal incorporating information about depth and distant objects to enhance affinity estimation for improved data association, track lifecycle management, false-positive elimination, false-negative propagation, and track confidence score refinement. Our main contributions include a novel fusion approach for combining camera and LiDAR sensory signals to learn affinities, and a first-of-its-kind multimodal sequential track confidence refinement technique that fuses 2D and 3D detections. Additionally, we perform an ablative analysis on each fusion step to demonstrate the added benefits of incorporating the camera sensor, particular for small, distant objects that tend to suffer from the depth-sensing limits and sparsity of LiDAR sensors. In sum, our technique achieves state-of-the-art performance on the nuScenes benchmark amongst multimodal 3D MOT algorithms using CenterPoint detections.
翻译:3D多目标跟踪(MOT)对自主移动体安全导航场景至关重要。为最大化自主体的感知能力,我们旨在开发一种融合相机与激光雷达传感器信息的3D MOT框架。基于我们先前仅依赖激光雷达的ShaSTA工作(该工作通过建模形状与时空关联实现3D MOT),我们提出了一种新颖的相机-激光雷达融合方法用于关联学习。其核心在于提出一种融合技术,通过生成包含深度与远距离物体信息的丰富感知信号,增强关联估计能力,从而改进数据关联、轨迹生命周期管理、误检消除、漏检传播以及轨迹置信度分数优化。主要贡献包括:一种融合相机与激光雷达感知信号的关联学习新方法,以及首个融合2D与3D检测的多模态序列轨迹置信度优化技术。此外,我们对每个融合步骤进行消融分析,以论证引入相机传感器的增益——尤其对于因激光雷达深度感知局限与点稀疏性而易受影响的微小远距离物体。综合而言,本方法在使用CenterPoint检测结果的多模态3D MOT算法中,于nuScenes基准上达到了最优性能。