Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.
翻译:文本到图像(T2I)生成模型取得了令人瞩目的进展,涌现出大量高性能模型,能够生成美观且逼真的图像。尽管取得了进展,这些模型仍然难以生成与输入提示一致的图像,常常无法正确捕捉对象数量、关系和属性。现有提高提示-图像一致性的解决方案面临以下挑战:(1)通常需要模型微调,(2)仅关注邻近提示样本,(3)在图像质量、表示多样性和提示-图像一致性之间受到不利权衡的影响。在本文中,我们针对这些挑战,引入了一个基于提示优化的T2I框架OPT2I,该框架利用大型语言模型(LLM)来改善T2I模型中的提示-图像一致性。我们的框架从用户提示开始,迭代生成修订后的提示,以最大化一致性得分为目标。我们在MSCOCO和PartiPrompts两个数据集上的广泛验证表明,OPT2I在DSG得分方面可将初始一致性得分提升高达24.9%,同时保持FID并增加生成数据与真实数据之间的召回率。我们的工作通过利用LLM的力量,为构建更可靠和鲁棒的T2I系统铺平了道路。