DETR-based object detectors have achieved remarkable performance but are sample-inefficient and exhibit slow convergence. Unsupervised pretraining has been found to be helpful to alleviate these impediments, allowing training with large amounts of unlabeled data to improve the detector's performance. However, existing methods have their own limitations, like keeping the detector's backbone frozen in order to avoid performance degradation and utilizing pretraining objectives misaligned with the downstream task. To overcome these limitations, we propose a simple pretraining framework for DETR-based detectors that consists of three simple yet key ingredients: (i) richer, semantics-based initial proposals derived from high-level feature maps, (ii) discriminative training using object pseudo-labels produced via clustering, (iii) self-training to take advantage of the improved object proposals learned by the detector. We report two main findings: (1) Our pretraining outperforms prior DETR pretraining works on both the full and low data regimes by significant margins. (2) We show we can pretrain DETR from scratch (including the backbone) directly on complex image datasets like COCO, paving the path for unsupervised representation learning directly using DETR.
翻译:基于DETR的目标检测器虽取得了显著性能,但存在样本效率低下与收敛缓慢的问题。无监督预训练被证实有助于缓解这些障碍,通过利用大量未标注数据训练可提升检测器性能。然而现有方法存在固有局限,例如为避免性能退化而冻结检测器骨干网络,以及采用与下游任务不匹配的预训练目标。为克服上述局限,我们提出一种针对DETR检测器的简洁预训练框架,其包含三项简单但关键要素:(i) 基于高层特征图生成的富含语义的初始候选框;(ii) 利用聚类生成伪标签进行判别性训练;(iii) 通过自训练充分利用检测器学习到的优化候选框。本研究有两项主要发现:(1) 在完整数据与低数据场景下,我们的预训练方法均显著优于既往DETR预训练工作;(2) 我们首次证明可直接在COCO等复杂图像数据集上从零预训练完整DETR模型(含骨干网络),为直接利用DETR进行无监督表示学习开辟路径。