Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation

We propose a new paradigm to automatically generate training data with accurate labels at scale using the text-to-image synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach1 decouples training data generation into foreground object generation, and contextually coherent background generation. To generate foreground objects, we employ a straightforward textual template, incorporating the object class name as input prompts. This is fed into a text-to-image synthesis framework, producing various foreground images set against isolated backgrounds. A foreground-background segmentation algorithm is then used to generate foreground object masks. To generate context images, we begin by creating language descriptions of the context. This is achieved by applying an image captioning method to a small set of images representing the desired context. These textual descriptions are then transformed into a diverse array of context images via a text-to-image synthesis framework. Subsequently, we composite these with the foreground object masks produced in the initial step, utilizing a cut-and-paste method, to formulate the training data. We demonstrate the advantages of our approach on five object detection and segmentation datasets, including Pascal VOC and COCO. We found that detectors trained solely on synthetic data produced by our method achieve performance comparable to those trained on real data (Fig. 1). Moreover, a combination of real and synthetic data yields even much better results. Further analysis indicates that the synthetic data distribution complements the real data distribution effectively. Additionally, we emphasize the compositional nature of our data generation approach in out-of-distribution and zero-shot data generation scenarios. We open-source our code at https://github.com/gyhandy/Text2Image-for-Detection

翻译：我们提出了一种新范式，通过文本到图像合成框架（例如DALL-E、Stable Diffusion等）自动生成带有精确标注的大规模训练数据。该方法将训练数据生成解耦为前景物体生成与上下文一致的背景生成。为生成前景物体，我们采用直接文本模板，将物体类别名称作为输入提示，输入文本到图像合成框架，从而生成置于隔离背景中的多样前景图像。随后，利用前景-背景分割算法生成前景物体掩码。为生成上下文图像，我们首先创建上下文语言描述，通过将图像描述方法应用于少量代表所需上下文的图像来实现。这些文本描述再经由文本到图像合成框架转化为多样化的上下文图像。最后，我们利用剪切-粘贴方法，将这些上下文图像与第一步生成的前景物体掩码合成，形成训练数据。我们在五个目标检测与分割数据集（包括Pascal VOC和COCO）上展示了该方法的优势。研究发现，仅使用我们方法合成的数据训练的检测器，其性能可与真实数据训练的检测器相媲美（图1）。此外，结合真实与合成数据可获得更优结果。进一步分析表明，合成数据分布能有效补充真实数据分布。同时，我们强调了该方法在分布外和零样本数据生成场景中的组合特性。我们已在 https://github.com/gyhandy/Text2Image-for-Detection 开源代码。