Self-supervised feature learning enables perception systems to benefit from the vast amount of raw data being recorded by vehicle fleets all over the world. However, their potential to learn dense representations from sequential data has been relatively unexplored. In this work, we propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks. We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for instance-level perception architectures, and formulate the sequential ordering prediction by comparing similarities between sets of feature vectors in a transformer-based multi-frame architecture. Extensive evaluation in automated driving domains on the BDD100K and MOT17 datasets shows that our TempO approach outperforms existing self-supervised single-frame pre-training methods as well as supervised transfer learning initialization strategies on standard object detection and multi-object tracking benchmarks.
翻译:自监督特征学习使感知系统能够利用全球车队记录的大量原始数据。然而,这些方法在从序列数据中学习密集表示方面的潜力尚未充分探索。本文提出TempO——一种用于感知任务区域级特征预训练的时间顺序预文本任务。我们通过无序提议特征向量集合嵌入每一帧,这种表示自然适用于实例级感知架构,并在基于Transformer的多帧架构中通过比较特征向量集合间的相似性来制定序列顺序预测。在BDD100K和MOT17自动驾驶领域数据集上的大量评估表明,我们的TempO方法在标准目标检测和多目标跟踪基准任务上,优于现有的自监督单帧预训练方法以及有监督迁移学习初始化策略。