Self-supervised pre-training and transformer-based networks have significantly improved the performance of object detection. However, most of the current self-supervised object detection methods are built on convolutional-based architectures. We believe that the transformers' sequence characteristics should be considered when designing a transformer-based self-supervised method for the object detection task. To this end, we propose SeqCo-DETR, a novel Sequence Consistency-based self-supervised method for object DEtection with TRansformers. SeqCo-DETR defines a simple but effective pretext by minimizes the discrepancy of the output sequences of transformers with different image views as input and leverages bipartite matching to find the most relevant sequence pairs to improve the sequence-level self-supervised representation learning performance. Furthermore, we provide a mask-based augmentation strategy incorporated with the sequence consistency strategy to extract more representative contextual information about the object for the object detection task. Our method achieves state-of-the-art results on MS COCO (45.8 AP) and PASCAL VOC (64.1 AP), demonstrating the effectiveness of our approach.
翻译:自监督预训练与基于Transformer的网络显著提升了目标检测的性能。然而,当前大多数自监督目标检测方法都构建在卷积架构之上。我们认为,在为目标检测任务设计基于Transformer的自监督方法时,应考虑Transformer的序列特性。为此,我们提出SeqCo-DETR,一种新颖的基于序列一致性的Transformer自监督目标检测方法。SeqCo-DETR通过最小化以不同图像视角作为输入时Transformer输出序列的差异,定义了一种简单而有效的代理任务,并利用二分图匹配寻找最相关的序列对,以提升序列级自监督表示学习的性能。此外,我们结合序列一致性策略提出了一种基于掩码的增强策略,从而为目标检测任务提取更具代表性的物体上下文信息。我们的方法在MS COCO(45.8 AP)和PASCAL VOC(64.1 AP)数据集上取得了最先进的结果,证明了所提方法的有效性。