Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bounding boxes or intricate instance segmentation masks, and combinations thereof. We propose three major changes to text-to-image models that enable precise instance-level control. Our UniFusion block enables instance-level conditions for text-to-image models, the ScaleU block improves image fidelity, and our Multi-instance Sampler improves generations for multiple instances. InstanceDiffusion significantly surpasses specialized state-of-the-art models for each location condition. Notably, on the COCO dataset, we outperform previous state-of-the-art by 20.4% AP$_{50}^\text{box}$ for box inputs, and 25.4% IoU for mask inputs.
翻译:文本到图像扩散模型能够生成高质量图像,但无法对图像中的单个实例进行控制。我们提出InstanceDiffusion方法,为文本到图像扩散模型添加精确的实例级控制能力。InstanceDiffusion支持针对每个实例的自由形式语言描述,并允许通过多种灵活方式指定实例位置,包括简单单点、涂鸦、边界框、精细的实例分割掩模及其组合。我们针对文本到图像模型提出三项关键改进以实现精确的实例级控制:UniFusion模块使文本到图像模型具备实例级条件处理能力,ScaleU模块提升图像保真度,多实例采样器优化多实例生成效果。InstanceDiffusion在各类位置条件下均显著超越各专业领域的最先进模型。值得注意的是,在COCO数据集上,针对边界框输入的AP$_{50}^\text{box}$指标,我们以20.4%的优势超越此前最优方法;针对掩模输入的IoU指标,优势达25.4%。