We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images.
翻译:我们提出TokenCompose,一种用于文本到图像生成的潜在扩散模型,能够增强用户指定文本提示与模型生成图像之间的一致性。尽管潜在扩散模型取得了巨大成功,但其标准去噪过程仅将文本提示作为条件,缺乏对文本提示与图像内容一致性的显式约束,导致多类别对象组合效果不佳。TokenCompose旨在通过微调阶段引入图像内容与对象分割图之间的Token级一致性项,提升多类别实例组合能力。该方法可直接应用于现有文本条件扩散模型的训练流程,无需额外人工标注信息。通过微调Stable Diffusion,该模型在多类别实例组合效果上展现出显著提升,同时生成图像的逼真度也得到增强。