Imagine Alice has a specific image $x^\ast$ in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image $x^{p*}$. Although $x^{p*}$ is reasonably close to $x^\ast$, Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image $x^+$ is closer to $x^\ast$ than $x^{p*}$. Leveraging this observation, we develop MultiBO (Multi-Choice Preferential Bayesian Optimization) that carefully generates $K$ new images as a function of $x^{p*}$, gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of $K$ images. We show that within $B$ rounds of user feedback, it is possible to arrive much closer to $x^\ast$, even though the generative model has no information about $x^\ast$. Qualitative scores from $30$ users, combined with quantitative metrics compared across $5$ baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.
翻译:假设Alice心中有一幅特定的图像$x^\ast$,例如她童年成长街道的景象。为了生成这幅精确图像,她通过多轮提示引导生成模型,最终得到图像$x^{p*}$。尽管$x^{p*}$已相当接近$x^\ast$,Alice发现仅通过语言提示难以完全消除两者间的差距。本文通过观察发现:即使在语言描述达到极限后,人类仍能判断新图像$x^+$是否比$x^{p*}$更接近$x^\ast$。基于这一观察,我们开发了MultiBO(多选择偏好贝叶斯优化)方法,该方法能够以$x^{p*}$为基准生成$K$幅新图像,获取用户的偏好反馈,利用反馈指导扩散模型,并最终生成新的$K$幅图像集合。研究表明,在$B$轮用户反馈过程中,即使生成模型未获得$x^\ast$的直接信息,仍能生成与$x^\ast$高度接近的图像。通过30位用户的定性评分,以及相较于5个基线模型的定量指标对比,实验结果表明该方法具有显著优势,证明人类的多选择反馈能有效推动个性化图像生成。