Diffusion models have recently shown remarkable success in high-quality image generation. Sometimes, however, a pre-trained diffusion model exhibits partial misalignment in the sense that the model can generate good images, but it sometimes outputs undesirable images. If so, we simply need to prevent the generation of the bad images, and we call this task censoring. In this work, we present censored generation with a pre-trained diffusion model using a reward model trained on minimal human feedback. We show that censoring can be accomplished with extreme human feedback efficiency and that labels generated with a mere few minutes of human feedback are sufficient. Code available at: https://github.com/tetrzim/diffusion-human-feedback.
翻译:扩散模型近年来在高品质图像生成领域展现出显著成功。然而,预训练扩散模型有时会出现部分偏差,即模型能生成优质图像,但偶尔也会输出不合预期的图像。针对这种情况,我们只需阻止不良图像的生成,并将此任务定义为审查。本研究提出一种利用最小人类反馈训练的奖励模型对预训练扩散模型进行审查生成的方法。实验证明,该审查任务可实现极高效率的人类反馈利用,仅需几分钟人工标注即可生成足够有效的标签。代码开源地址:https://github.com/tetrzim/diffusion-human-feedback。