Existing deep trackers are typically trained with largescale video frames with annotated bounding boxes. However, these bounding boxes are expensive and time-consuming to annotate, in particular for large scale datasets. In this paper, we propose to learn tracking representations from single point annotations (i.e., 4.5x faster to annotate than the traditional bounding box) in a weakly supervised manner. Specifically, we propose a soft contrastive learning (SoCL) framework that incorporates target objectness prior into end-to-end contrastive learning. Our SoCL consists of adaptive positive and negative sample generation, which is memory-efficient and effective for learning tracking representations. We apply the learned representation of SoCL to visual tracking and show that our method can 1) achieve better performance than the fully supervised baseline trained with box annotations under the same annotation time cost; 2) achieve comparable performance of the fully supervised baseline by using the same number of training frames and meanwhile reducing annotation time cost by 78% and total fees by 85%; 3) be robust to annotation noise.
翻译:现有深度跟踪器通常在标注了边界框的大规模视频帧上进行训练。然而,这些边界框的标注成本高昂且耗时,尤其对于大规模数据集而言。本文提出以弱监督方式从单点标注(标注速度比传统边界框快4.5倍)中学习跟踪表征。具体地,我们提出软对比学习框架,将目标先验知识融入端到端对比学习。该框架包含自适应正负样本生成机制,在内存高效的同时有效学习跟踪表征。我们将SoCL学习到的表征应用于视觉跟踪,实验表明:1)在相同标注时间成本下,其性能优于使用边界框标注的全监督基线;2)使用相同训练帧数时,性能与全监督基线相当,同时减少78%的标注时间和85%的总费用;3)对标注噪声具有鲁棒性。