The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content-position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial-semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales. Our code is available at https://github.com/haoweiz23/ReCon .
翻译:数据集的规模与质量对于训练鲁棒的感知模型至关重要。然而,获取大规模标注数据既昂贵又耗时。生成模型已成为一种强大的数据增强工具,能够合成符合期望分布的样本。然而,当前的生成方法通常依赖于复杂的后处理或在大规模数据集上进行大量微调以获得令人满意的结果,并且仍然容易受到内容-位置不匹配和语义泄漏的影响。为克服这些限制,我们提出了ReCon,一种新颖的增强框架,旨在提升结构可控生成模型在目标检测任务中的能力。ReCon将区域引导的校正集成到扩散采样过程中,利用预训练感知模型的反馈在扩散采样过程中校正生成错误的区域。我们进一步提出了区域对齐的交叉注意力机制,以强制图像区域与其文本提示之间的空间-语义对齐,从而同时提升语义一致性和整体图像保真度。大量实验表明,ReCon显著提高了生成数据的质量和可训练性,在不同数据集、骨干网络架构和数据规模上均取得了持续的性能提升。我们的代码可在 https://github.com/haoweiz23/ReCon 获取。