In this work, we study self-supervised multiple object tracking without using any video-level association labels. We propose to cast the problem of multiple object tracking as learning the frame-wise associations between detections in consecutive frames. To this end, we propose differentiable soft object assignment for object association, making it possible to learn features tailored to object association with differentiable end-to-end training. With this training approach in hand, we develop an appearance-based model for learning instance-aware object features used to construct a cost matrix based on the pairwise distances between the object features. We train our model using temporal and multi-view data, where we obtain association pseudo-labels using optical flow and disparity information. Unlike most self-supervised tracking methods that rely on pretext tasks for learning the feature correspondences, our method is directly optimized for cross-object association in complex scenarios. As such, the proposed method offers a reidentification-based MOT approach that is robust to training hyperparameters and does not suffer from local minima, which are a challenge in self-supervised methods. We evaluate our proposed model on the KITTI, Waymo, nuScenes, and Argoverse datasets, consistently improving over other unsupervised methods ($7.8\%$ improvement in association accuracy on nuScenes).
翻译:本文研究无需视频级关联标签的自监督多目标跟踪问题。我们提出将多目标跟踪问题转化为学习连续帧间检测结果的逐帧关联任务。为此,我们引入可微分的软目标分配机制用于目标关联,通过端到端可微训练实现专用于目标关联的特征学习。基于该训练方法,我们开发了用于学习实例感知目标特征的外观模型,通过目标特征间的成对距离构建代价矩阵。模型利用时序与多视角数据进行训练,通过光流与视差信息获取关联伪标签。与大多数依赖预文任务学习特征对应关系的自监督跟踪方法不同,本方法直接针对复杂场景中的跨目标关联进行优化。因此,所提方法提供了一种对训练超参数鲁棒且不会陷入自监督方法常见的局部最优困境的基于重识别的MOT方案。我们在KITTI、Waymo、nuScenes和Argoverse数据集上评估了所提模型,相较其他无监督方法持续获得性能提升(在nuScenes上关联准确率提升7.8%)。