The default paradigm of post-training text-to-image generators includes post-hoc selection of generated images, and subsequent training with one reward model to align the generator to the reward, typically user preference. This discards informative data as well as optimizes only for a single reward, hence harming diversity, semantic fidelity and efficiency. Instead, we propose MIRO, a method that conditions the model on multiple rewards during training, thus letting the model learn user preferences directly. MIRO pre-training both improves the visual quality of the generated images and speeds up the training, achieving state of the art on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).
翻译:文本到图像生成器的常规后训练范式包括:对生成的图像进行事后筛选,并随后使用一个奖励模型(通常是用户偏好)来训练生成器以与奖励对齐。这种方法不仅丢弃了信息丰富的数据,而且仅针对单一奖励进行优化,从而损害了多样性、语义保真度和效率。为此,我们提出MIRO方法,该方法在训练过程中使模型以多个奖励为条件,从而让模型直接学习用户偏好。MIRO预训练既提升了生成图像的视觉质量,又加速了训练过程,在GenEval组合基准测试和用户偏好评分(PickAScore、ImageReward、HPSv2)上均达到了最先进水平。