Imagine Alice has a specific image $x^\ast$ in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image $x^{p*}$. Although $x^{p*}$ is reasonably close to $x^\ast$, Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image $x^+$ is closer to $x^\ast$ than $x^{p*}$. Leveraging this observation, we develop MultiBO (Multi-Choice Preferential Bayesian Optimization) that carefully generates $K$ new images as a function of $x^{p*}$, gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of $K$ images. We show that within $B$ rounds of user feedback, it is possible to arrive much closer to $x^\ast$, even though the generative model has no information about $x^\ast$. Qualitative scores from $30$ users, combined with quantitative metrics compared across $5$ baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.
翻译:想象一下,爱丽丝脑海中有一幅特定的图像 $x^\ast$,比如她童年成长街道的景象。为了生成这幅精确的图像,她通过多轮提示引导生成模型,最终得到图像 $x^{p*}$。尽管 $x^{p*}$ 与 $x^\ast$ 相当接近,但爱丽丝发现仅凭语言提示难以完全消除这一差距。本文旨在缩小这一差距,其核心观察是:即使语言描述已达极限,人类仍能判断新图像 $x^+$ 是否比 $x^{p*}$ 更接近 $x^\ast$。基于这一观察,我们提出了MultiBO(多选择偏好贝叶斯优化)方法。该方法以 $x^{p*}$ 为基准函数式生成 $K$ 幅新图像,获取用户的偏好反馈,利用反馈指导扩散模型,并最终生成一组新的 $K$ 幅图像。研究表明,在 $B$ 轮用户反馈内,即使生成模型未获得 $x^\ast$ 的任何信息,仍能生成与 $x^\ast$ 高度接近的图像。通过30名用户的定性评分,以及与5个基线模型对比的定量指标,均显示出令人鼓舞的结果,表明人类的多选择反馈可有效应用于个性化图像生成任务。