We present DiffChat, a novel method to align Large Language Models (LLMs) to "chat" with prompt-as-input Text-to-Image Synthesis (TIS) models (e.g., Stable Diffusion) for interactive image creation. Given a raw prompt/image and a user-specified instruction, DiffChat can effectively make appropriate modifications and generate the target prompt, which can be leveraged to create the target image of high quality. To achieve this, we first collect an instruction-following prompt engineering dataset named InstructPE for the supervised training of DiffChat. Next, we propose a reinforcement learning framework with the feedback of three core criteria for image creation, i.e., aesthetics, user preference, and content integrity. It involves an action-space dynamic modification technique to obtain more relevant positive samples and harder negative samples during the off-policy sampling. Content integrity is also introduced into the value estimation function for further improvement of produced images. Our method can exhibit superior performance than baseline models and strong competitors based on both automatic and human evaluations, which fully demonstrates its effectiveness.
翻译:我们提出DiffChat,一种将大型语言模型(LLMs)与提示词驱动的文本到图像合成(TIS)模型(如Stable Diffusion)对齐以实现交互式图像创作的新颖方法。给定原始提示词/图像和用户指定指令,DiffChat能够有效进行适当修改并生成目标提示词,进而生成高质量的指定图像。为实现该目标,我们首先构建了名为InstructPE的指令遵循型提示工程数据集,用于DiffChat的有监督训练。其次,我们提出基于图像创作三大核心准则(美学质量、用户偏好、内容完整性)反馈的强化学习框架。该框架采用动作空间动态修正技术,在离线策略采样中获得更多相关正样本与更难负样本。同时将内容完整性引入价值估计函数,进一步提升生成图像质量。基于自动评估与人工评估的实验表明,本方法在性能上显著优于基线模型及强竞争力对比方法,充分验证了其有效性。