Motivated by that DETR-based approaches have established new records on COCO detection and segmentation benchmarks, many recent endeavors show increasing interest in how to further improve DETR-based approaches by pre-training the Transformer in a self-supervised manner while keeping the backbone frozen. Some studies already claimed significant improvements in accuracy. In this paper, we take a closer look at their experimental methodology and check if their approaches are still effective on the very recent state-of-the-art such as $\mathcal{H}$-Deformable-DETR. We conduct thorough experiments on COCO object detection tasks to study the influence of the choice of pre-training datasets, localization, and classification target generation schemes. Unfortunately, we find the previous representative self-supervised approach such as DETReg, fails to boost the performance of the strong DETR-based approaches on full data regimes. We further analyze the reasons and find that simply combining a more accurate box predictor and Objects$365$ benchmark can significantly improve the results in follow-up experiments. We demonstrate the effectiveness of our approach by achieving strong object detection results of AP=$59.3\%$ on COCO val set, which surpasses $\mathcal{H}$-Deformable-DETR + Swin-L by +$1.4\%$. Last, we generate a series of synthetic pre-training datasets by combining the very recent image-to-text captioning models (LLaVA) and text-to-image generative models (SDXL). Notably, pre-training on these synthetic datasets leads to notable improvements in object detection performance. Looking ahead, we anticipate substantial advantages through the future expansion of the synthetic pre-training dataset.
翻译:受基于DETR的方法在COCO检测与分割基准上不断刷新记录的启发,近期大量研究致力于在保持骨干网络冻结的前提下,通过自监督方式预训练Transformer来进一步改进基于DETR的方法。已有研究声称在精度上取得了显著提升。本文深入审视了这些实验方法,并检验其是否仍适用于$\mathcal{H}$-Deformable-DETR等最前沿模型。我们针对COCO目标检测任务开展了详尽实验,以探究预训练数据集选择、定位与分类目标生成方案的影响。遗憾的是,我们发现DETReg等先前代表性自监督方法未能提升强基DETR方法在全数据场景下的性能。我们进一步分析了原因,发现仅需结合更精确的边界框预测器与Objects$365$基准,便可在后续实验中显著改善结果。我们通过COCO验证集上AP=$59.3\%$的强目标检测结果验证了方法的有效性,该结果比$\mathcal{H}$-Deformable-DETR + Swin-L高出+$1.4\%$。最后,我们结合最新的图像到文本描述模型(LLaVA)与文本到图像生成模型(SDXL)生成了一系列合成预训练数据集。值得注意的是,在这些合成数据集上预训练显著提升了目标检测性能。展望未来,合成预训练数据集的持续扩展将带来更大优势。