Text-to-image generation has achieved astonishing results, yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering, scene layout conditioning, or image editing techniques which often require hand drawn masks. Nonetheless, pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards adressing this challenge, we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multilayer, instance-wise RGBA decompositions, and over 100K instance images. To build MuLAn, we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models, and by developing three modules: image decomposition for instance discovery and extraction, instance completion to reconstruct occluded areas, and image re-assembly. We use our pipeline to create MuLAn-COCO and MuLAn-LAION datasets, which contain a variety of image decompositions in terms of style, composition and complexity. With MuLAn, we provide the first photorealistic resource providing instance decomposition and occlusion information for high quality images, opening up new avenues for text-to-image generative AI research. With this, we aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions. MuLAn data resources are available at https://MuLAn-dataset.github.io/.
翻译:文本到图像生成已取得惊人成果,但在精确空间可控性和提示忠实度方面仍极具挑战性。这一局限性通常通过繁琐的提示工程、场景布局条件约束或需要手绘蒙版的图像编辑技术来缓解。然而,由于栅格化RGB输出图像通常具有扁平结构特性,现有工作难以利用场景中自然的实例级组合性。针对这一挑战,我们引入MuLAn:一个新颖数据集,包含超过4.4万张RGB图像的多层实例级RGBA分解标注,以及超过10万张实例图像。为构建MuLAn,我们开发了一条无需训练的流水线,可将单目RGB图像分解为包含背景与孤立实例的RGBA层叠结构。该流水线通过使用预训练通用模型,并开发三个模块实现:用于实例发现与提取的图像分解模块、用于重建遮挡区域的实例补全模块,以及图像重组模块。我们利用该流水线创建了MuLAn-COCO和MuLAn-LAION数据集,涵盖风格、构图和复杂度各异的图像分解结果。借助MuLAn,我们提供了首个面向高质量图像、包含实例分解与遮挡信息的光照真实感资源,为文本到图像生成式AI研究开辟新路径。我们的目标是通过此项工作,推动尤其是分层解决方案的新型生成与编辑技术发展。MuLAn数据资源可于https://MuLAn-dataset.github.io/获取。