3D multi-object tracking (MOT) is vital for many applications including autonomous driving vehicles and service robots. With the commonly used tracking-by-detection paradigm, 3D MOT has made important progress in recent years. However, these methods only use the detection boxes of the current frame to obtain trajectory-box association results, which makes it impossible for the tracker to recover objects missed by the detector. In this paper, we present TrajectoryFormer, a novel point-cloud-based 3D MOT framework. To recover the missed object by detector, we generates multiple trajectory hypotheses with hybrid candidate boxes, including temporally predicted boxes and current-frame detection boxes, for trajectory-box association. The predicted boxes can propagate object's history trajectory information to the current frame and thus the network can tolerate short-term miss detection of the tracked objects. We combine long-term object motion feature and short-term object appearance feature to create per-hypothesis feature embedding, which reduces the computational overhead for spatial-temporal encoding. Additionally, we introduce a Global-Local Interaction Module to conduct information interaction among all hypotheses and models their spatial relations, leading to accurate estimation of hypotheses. Our TrajectoryFormer achieves state-of-the-art performance on the Waymo 3D MOT benchmarks.
翻译:三维多目标跟踪(3D MOT)对于自动驾驶车辆和服务机器人等众多应用至关重要。借助常用的"先检测后跟踪"范式,3D MOT近年来取得了重要进展。然而,这些方法仅使用当前帧的检测框来获取轨迹-检测框关联结果,导致跟踪器无法恢复被检测器遗漏的目标。在本文中,我们提出TrajectoryFormer,一种新颖的基于点云的3D MOT框架。为恢复被检测器遗漏的目标,我们通过混合候选框(包括时间预测框和当前帧检测框)生成多条轨迹假设,用于轨迹-检测框关联。预测框能将目标的历史轨迹信息传播至当前帧,从而使网络能够容忍被跟踪目标的短期漏检。我们结合长期目标运动特征与短期目标外观特征,构建每项假设的特征嵌入,从而减少时空编码的计算开销。此外,我们引入全局-局部交互模块,在所有假设之间进行信息交互并建模其空间关系,从而实现对假设的精确估计。我们的TrajectoryFormer在Waymo 3D MOT基准测试中取得了最先进的性能。