Multiple Object Tracking (MOT) is crucial to autonomous vehicle perception. End-to-end transformer-based algorithms, which detect and track objects simultaneously, show great potential for the MOT task. However, most existing methods focus on image-based tracking with a single object category. In this paper, we propose an end-to-end transformer-based MOT algorithm (MotionTrack) with multi-modality sensor inputs to track objects with multiple classes. Our objective is to establish a transformer baseline for the MOT in an autonomous driving environment. The proposed algorithm consists of a transformer-based data association (DA) module and a transformer-based query enhancement module to achieve MOT and Multiple Object Detection (MOD) simultaneously. The MotionTrack and its variations achieve better results (AMOTA score at 0.55) on the nuScenes dataset compared with other classical baseline models, such as the AB3DMOT, the CenterTrack, and the probabilistic 3D Kalman filter. In addition, we prove that a modified attention mechanism can be utilized for DA to accomplish the MOT, and aggregate history features to enhance the MOD performance.
翻译:多目标跟踪(MOT)对于自动驾驶感知至关重要。基于Transformer的端到端算法能够同时检测和跟踪目标,在MOT任务中展现出巨大潜力。然而,现有方法大多聚焦于单一物体类别的图像跟踪。本文提出一种基于Transformer的端到端MOT算法(MotionTrack),该算法利用多模态传感器输入实现对多类别目标的跟踪。旨在为自动驾驶环境下的MOT建立Transformer基线。所提算法包含基于Transformer的数据关联(DA)模块和基于Transformer的查询增强模块,可同时实现MOT与多目标检测(MOD)。在nuScenes数据集上,MotionTrack及其变体相较于其他经典基线模型(如AB3DMOT、CenterTrack和概率3D卡尔曼滤波器)取得了更优结果(AMOTA得分为0.55)。此外,我们证明修改后的注意力机制可用于DA以实现MOT,并聚合历史特征以增强MOD性能。