In this paper, we introduce a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising arbitrary instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector. This enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer for object detection. We conduct thorough experiments to show that, object detector can be enhanced while training on the synthetic dataset from InstaGen, demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 to 5.2 AP) scenarios.
翻译:本文提出了一种新范式,通过利用扩散模型生成的合成数据集进行训练,以增强目标检测器的能力(例如扩展检测类别或提升检测性能)。具体而言,我们在预训练的生成式扩散模型中集成一个实例级定位头,使其具备在生成图像中定位任意实例的能力。该定位头通过现成目标检测器的监督信号,将类别名称的文本嵌入与扩散模型的区域视觉特征对齐,并采用新颖的自训练方案覆盖检测器未涉及的(新)类别。这一增强版扩散模型(称为InstaGen)可作为目标检测的数据合成器。大量实验表明,基于InstaGen生成的合成数据集训练目标检测器,可在开放词汇场景(AP提升+4.5)和数据稀疏场景(AP提升+1.2至5.2)中展现出超越现有最优方法的卓越性能。