ADA-Track++: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association

Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm and detect objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track++, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. We also propose an auxiliary token in this attention-based association module, which helps mitigate disproportionately high attention to incorrect association targets caused by attention normalization. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms.

翻译：许多基于查询的三维多目标跟踪方法采用"通过注意力进行跟踪"的范式，利用轨迹查询进行身份一致性检测，并使用对象查询进行身份无关的轨迹生成。然而，这种范式将检测任务与跟踪任务的查询纠缠在同一个嵌入表示中，这并非最优方案。其他方法则类似于"通过检测进行跟踪"的范式，先使用解耦的轨迹查询和检测查询进行目标检测，再进行后续关联。但这些方法未能充分利用检测任务与关联任务之间的协同效应。为融合两种范式的优势，本文提出ADA-Track++——一种面向多视角相机的端到端三维多目标跟踪新框架。我们引入基于边缘增强交叉注意力的可学习数据关联模块，该模块能同时利用外观特征与几何特征。在此基于注意力的关联模块中，我们还提出了一种辅助令牌，有助于缓解因注意力归一化导致的错误关联目标获得过高注意力的问题。进一步地，我们将该关联模块集成到基于DETR的三维检测器的解码器层中，使其能够同时执行类似DETR的查询-图像交叉注意力（用于检测）和查询-查询交叉注意力（用于数据关联）。通过堆叠这些解码器层，查询向量得以在检测任务与关联任务之间交替优化，从而有效利用任务间的依赖关系。我们在nuScenes数据集上评估了所提方法，实验结果证明了本方法相较于前述两种范式的优越性。