In this paper, we propose a self-supervised RGB-T tracking method. Different from existing deep RGB-T trackers that use a large number of annotated RGB-T image pairs for training, our RGB-T tracker is trained using unlabeled RGB-T video pairs in a self-supervised manner. We propose a novel cross-input consistency-based self-supervised training strategy based on the idea that tracking can be performed using different inputs. Specifically, we construct two distinct inputs using unlabeled RGB-T video pairs. We then track objects using these two inputs to generate results, based on which we construct our cross-input consistency loss. Meanwhile, we propose a reweighting strategy to make our loss function robust to low-quality training samples. We build our tracker on a Siamese correlation filter network. To the best of our knowledge, our tracker is the first self-supervised RGB-T tracker. Extensive experiments on two public RGB-T tracking benchmarks demonstrate that the proposed training strategy is effective. Remarkably, despite training only with a corpus of unlabeled RGB-T video pairs, our tracker outperforms seven supervised RGB-T trackers on the GTOT dataset.
翻译:本文提出一种自监督的RGB-T跟踪方法。与现有依赖大量标注RGB-T图像对进行训练的深度RGB-T跟踪器不同,本文方法利用未标注的RGB-T视频对,以自监督方式进行训练。基于"可使用不同输入执行跟踪"这一思想,我们提出一种新颖的基于跨输入一致性的自监督训练策略。具体而言,我们利用未标注的RGB-T视频对构建两种不同的输入,并通过这两种输入分别执行目标跟踪以生成结果,进而构建跨输入一致性损失函数。同时,我们设计一种重加权策略,使损失函数对低质量训练样本具有鲁棒性。跟踪器基于孪生相关滤波网络构建。据我们所知,这是首个自监督RGB-T跟踪器。在两个公开RGB-T跟踪基准上的大量实验表明,所提训练策略效果显著。值得注意的是,尽管仅使用未标注的RGB-T视频对语料库进行训练,我们的跟踪器在GTOT数据集上仍优于七种有监督RGB-T跟踪器。