Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying complex heuristics to predict the states. In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately. STT consumes rich appearance, geometry, and motion signals through long term history of detections and is jointly optimized for both data association and state estimation tasks. Since the standard tracking metrics like MOTA and MOTP do not capture the combined performance of the two tasks in the wider spectrum of object states, we extend them with new metrics called S-MOTA and MOTPS that address this limitation. STT achieves competitive real-time performance on the Waymo Open Dataset.
翻译:在三维空间中跟踪物体对自动驾驶至关重要。为确保行驶安全,跟踪器必须能够可靠地跨帧跟踪物体,并准确估计其当前状态(如速度和加速度)。现有工作通常聚焦于关联任务,要么忽略了模型在状态估计上的性能,要么采用复杂的启发式方法预测状态。本文提出STT(Stateful Tracking),一种基于Transformer构建的具有状态记忆的跟踪模型,能够持续跟踪场景中的物体并精确预测其状态。STT通过长期检测历史消耗丰富的表观、几何和运动信号,并针对数据关联和状态估计任务进行联合优化。由于MOTA和MOTP等标准跟踪指标无法在更广泛的物体状态维度上捕捉上述两个任务的联合性能,我们扩展了这些指标,提出S-MOTA和MOTPS以解决该局限。STT在Waymo开放数据集上实现了具有竞争力的实时性能。