Pretraining on large-scale datasets can boost the performance of object detectors while the annotated datasets for object detection are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly pretrain models across aggregation of datasets to enhance data volume and diversity. In this paper, we propose a strong framework for utilizing Multiple datasets to pretrain DETR-like detectors, termed METR, without the need for manual label spaces integration. It converts the typical multi-classification in object detection into binary classification by introducing a pre-trained language model. Specifically, we design a category extraction module for extracting potential categories involved in an image and assign these categories into different queries by language embeddings. Each query is only responsible for predicting a class-specific object. Besides, to adapt our novel detection paradigm, we propose a group bipartite matching strategy that limits the ground truths to match queries assigned to the same category. Extensive experiments demonstrate that METR achieves extraordinary results on either multi-task joint training or the pretrain & finetune paradigm. Notably, our pre-trained models have high flexible transferability and increase the performance upon various DETR-like detectors on COCO val2017 benchmark. Codes will be available after this paper is published.
翻译:在大规模数据集上进行预训练能够提升目标检测器的性能,然而由于高昂的人工标注成本,目标检测的带标注数据集难以规模化扩展。当前我们拥有大量孤立且特定领域的数据集,因此,联合利用这些数据集进行预训练以增强数据规模与多样性具有重要价值。本文提出了一种名为METR的强框架,该框架无需手动整合标签空间,即可利用多个数据集对类似DETR的检测器进行预训练。通过引入预训练语言模型,我们将目标检测中典型的多分类任务转化为二分类任务。具体而言,我们设计了一个类别提取模块,用于提取图像中潜在的类别,并通过语言嵌入将这些类别分配到不同的查询中。每个查询仅负责预测一个特定类别的目标。此外,为适应这一新型检测范式,我们提出了一种分组二分匹配策略,该策略限制真实标签仅与分配给同一类别的查询进行匹配。大量实验表明,METR在多任务联合训练以及预训练-微调范式中均取得了卓越性能。值得注意的是,我们的预训练模型具有高度灵活的迁移能力,在COCO val2017基准测试上能够显著提升多种类DETR检测器的性能。本文所涉及的代码将在论文发表后公开。