This paper presents GenDet, a novel framework that redefines object detection as an image generation task. In contrast to traditional approaches, GenDet adopts a pioneering approach by leveraging generative modeling: it conditions on the input image and directly generates bounding boxes with semantic annotations in the original image space. GenDet establishes a conditional generation architecture built upon the large-scale pre-trained Stable Diffusion model, formulating the detection task as semantic constraints within the latent space. It enables precise control over bounding box positions and category attributes, while preserving the flexibility of the generative model. This novel methodology effectively bridges the gap between generative models and discriminative tasks, providing a fresh perspective for constructing unified visual understanding systems. Systematic experiments demonstrate that GenDet achieves competitive accuracy compared to discriminative detectors, while retaining the flexibility characteristic of generative methods.
翻译:本文提出GenDet,一种将目标检测重新定义为图像生成任务的新颖框架。与传统方法不同,GenDet采用了一种开创性方法:它基于输入图像进行条件化,直接在原始图像空间中生成带有语义标注的边界框。GenDet构建了一个基于大规模预训练Stable Diffusion模型的条件生成架构,将检测任务表述为潜在空间中的语义约束。该框架能够精确控制边界框位置和类别属性,同时保持生成模型的灵活性。这种新颖方法有效弥合了生成模型与判别任务之间的差距,为构建统一的视觉理解系统提供了全新视角。系统实验表明,与判别式检测器相比,GenDet在保持生成方法灵活性的同时,实现了具有竞争力的检测精度。