We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code will be released at https://github.com/Jiahao000/MosaicFusion.
翻译:我们提出MosaicFusion,一种简单而有效的基于扩散模型的数据增强方法,用于大词汇量实例分割。该方法无需训练,且不依赖任何标签监督。两个关键设计使我们能够利用现成的文本到图像扩散模型作为对象实例和掩码标注的有用数据集生成器。首先,我们将图像画布划分为多个区域,并执行单轮扩散过程,根据不同的文本提示同时生成多个实例。其次,我们通过聚合与对象提示相关的跨层和扩散时间步的交叉注意力图,后接简单的阈值化和边缘感知精化处理,获得相应的实例掩码。无需复杂技巧,我们的MosaicFusion可以为稀有和新颖类别生成大量合成标注数据。在具有挑战性的LVIS长尾和开放词汇基准上的实验结果表明,MosaicFusion能够显著提升现有实例分割模型的性能,尤其对稀有和新颖类别。代码将发布在https://github.com/Jiahao000/MosaicFusion。