Current semantic segmentation models typically require a substantial amount of manually annotated data, a process that is both time-consuming and resource-intensive. Alternatively, leveraging advanced text-to-image models such as Midjourney and Stable Diffusion has emerged as an efficient strategy, enabling the automatic generation of synthetic data in place of manual annotations. However, previous methods have been limited to generating single-instance images, as the generation of multiple instances with Stable Diffusion has proven unstable. To address this limitation and expand the scope and diversity of synthetic datasets, we propose a framework \textbf{Free-Mask} that combines a Diffusion Model for segmentation with advanced image editing capabilities, allowing for the integration of multiple objects into images via text-to-image models. Our method facilitates the creation of highly realistic datasets that closely emulate open-world environments while generating accurate segmentation masks. It reduces the labor associated with manual annotation and also ensures precise mask generation. Experimental results demonstrate that synthetic data generated by \textbf{Free-Mask} enables segmentation models to outperform those trained on real data, especially in zero-shot settings. Notably, \textbf{Free-Mask} achieves new state-of-the-art results on previously unseen classes in the VOC 2012 benchmark.
翻译:当前的语义分割模型通常需要大量手动标注数据,这一过程既耗时又耗费资源。另一种高效策略是利用先进的文生图模型(如Midjourney和Stable Diffusion),使其能够自动生成合成数据以替代人工标注。然而,先前的方法仅限于生成单实例图像,因为使用Stable Diffusion生成多实例图像已被证明是不稳定的。为突破这一限制并扩展合成数据集的范围与多样性,我们提出了一个框架**Free-Mask**,该框架将用于分割的扩散模型与先进的图像编辑能力相结合,从而允许通过文生图模型将多个对象整合到图像中。我们的方法有助于创建高度逼真的数据集,这些数据集能紧密模拟开放世界环境,同时生成精确的分割掩码。它不仅减少了与手动标注相关的人力成本,还确保了掩码生成的精确性。实验结果表明,由**Free-Mask**生成的合成数据使分割模型的性能优于在真实数据上训练的模型,尤其是在零样本设置下。值得注意的是,**Free-Mask**在VOC 2012基准测试中针对先前未见过的类别取得了新的最先进结果。