DEtection TRansformer (DETR) for object detection reaches competitive performance compared with Faster R-CNN via a transformer encoder-decoder architecture. However, trained with scratch transformers, DETR needs large-scale training data and an extreme long training schedule even on COCO dataset. Inspired by the great success of pre-training transformers in natural language processing, we propose a novel pretext task named random query patch detection in Unsupervised Pre-training DETR (UP-DETR). Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the input image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade off classification and localization preferences in the pretext task, we find that freezing the CNN backbone is the prerequisite for the success of pre-training transformers. (2) To perform multi-query localization, we develop UP-DETR with multi-query patch detection with attention mask. Besides, UP-DETR also provides a unified perspective for fine-tuning object detection and one-shot detection tasks. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation. Code and pre-training models: https://github.com/dddzg/up-detr.
翻译:用于目标检测的DEtection TRansformer(DETR)通过Transformer编码器-解码器架构,达到了与Faster R-CNN相媲美的竞争性能。然而,由于从头训练Transformer,DETR需要大规模训练数据,即使在COCO数据集上也需要极长的训练周期。受自然语言处理中Transformer预训练巨大成功的启发,我们提出了一种名为无监督预训练DETR(UP-DETR)中随机查询块检测的新型预文本任务。具体而言,我们从给定图像中随机裁剪图像块,并将其作为查询输入解码器。模型被预训练用于从输入图像中检测这些查询块。在预训练过程中,我们解决了两个关键问题:多任务学习和多查询定位。(1)为了在预文本任务中平衡分类与定位偏好,我们发现冻结CNN骨干网络是成功预训练Transformer的前提条件。(2)为实现多查询定位,我们开发了具有注意力掩码的多查询块检测UP-DETR。此外,UP-DETR还为微调目标检测和单样本检测任务提供了统一视角。在我们的实验中,UP-DETR显著提升了DETR的性能,在目标检测、单样本检测和全景分割任务上实现了更快的收敛速度和更高的平均精度。代码与预训练模型:https://github.com/dddzg/up-detr。