Recent advances of monocular 3D object detection facilitate the 3D multi-object tracking task based on low-cost camera sensors. In this paper, we find that the motion cue of objects along different time frames is critical in 3D multi-object tracking, which is less explored in existing monocular-based approaches. In this paper, we propose a motion-aware framework for monocular 3D MOT. To this end, we propose MoMA-M3T, a framework that mainly consists of three motion-aware components. First, we represent the possible movement of an object related to all object tracklets in the feature space as its motion features. Then, we further model the historical object tracklet along the time frame in a spatial-temporal perspective via a motion transformer. Finally, we propose a motion-aware matching module to associate historical object tracklets and current observations as final tracking results. We conduct extensive experiments on the nuScenes and KITTI datasets to demonstrate that our MoMA-M3T achieves competitive performance against state-of-the-art methods. Moreover, the proposed tracker is flexible and can be easily plugged into existing image-based 3D object detectors without re-training. Code and models are available at https://github.com/kuanchihhuang/MoMA-M3T.
翻译:单目三维目标检测的最新进展推动了基于低成本相机传感器的三维多目标跟踪任务。本文发现物体在不同时间帧中的运动线索对于三维多目标跟踪至关重要,而现有基于单目的方法对此研究较少。为此,我们提出了一种面向单目三维多目标跟踪的运动感知框架。具体而言,我们构建了MoMA-M3T框架,该框架主要由三个运动感知组件构成。首先,在特征空间中,我们将某个物体相对于所有目标轨迹的可能运动表示为运动特征。随后,通过运动变换器从时空视角对历史目标轨迹在时间帧上的演变进行建模。最后,我们提出了运动感知匹配模块,用于关联历史目标轨迹与当前观测结果,从而生成最终跟踪结果。在nuScenes和KITTI数据集上的大量实验表明,MoMA-M3T取得了与现有最优方法相媲美的性能。此外,该跟踪器具有灵活的可扩展性,能够直接嵌入现有的基于图像的3D目标检测器而无需重新训练。代码与模型已开源至https://github.com/kuanchihhuang/MoMA-M3T。