Text-to-image (T2I) models achieve high-fidelity generation through extensive training on large datasets. However, these models may unintentionally pick up undesirable biases of their training data, such as over-representation of particular identities in gender or ethnicity neutral prompts. Existing alignment methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) fail to address this problem effectively because they operate on pairwise preferences consisting of individual samples, while the aforementioned biases can only be measured at a population level. For example, a single sample for the prompt "doctor" could be male or female, but a model generating predominantly male doctors even with repeated sampling reflects a gender bias. To address this limitation, we introduce PopAlign, a novel approach for population-level preference optimization, while standard optimization would prefer entire sets of samples over others. We further derive a stochastic lower bound that directly optimizes for individual samples from preferred populations over others for scalable training. Using human evaluation and standard image quality and bias metrics, we show that PopAlign significantly mitigates the bias of pretrained T2I models while largely preserving the generation quality. Code is available at https://github.com/jacklishufan/PopAlignSDXL.
翻译:文本到图像(T2I)模型通过在大型数据集上的广泛训练实现了高保真度生成。然而,这些模型可能会无意中习得其训练数据中的不良偏见,例如在性别或种族中立的提示下过度表征特定身份。现有的对齐方法,如基于人类反馈的强化学习(RLHF)和直接偏好优化(DPO),未能有效解决此问题,因为它们基于由单个样本组成的成对偏好进行操作,而上述偏见只能在人口层面进行衡量。例如,针对提示“医生”的单个样本可能是男性或女性,但一个即使在重复采样下仍主要生成男性医生的模型则反映了性别偏见。为了解决这一局限性,我们引入了PopAlign,一种用于人口层面偏好优化的新方法,而标准优化则倾向于整个样本集而非其他。我们进一步推导了一个随机下界,该下界直接优化来自偏好人口而非其他人口的单个样本,以实现可扩展的训练。通过人工评估以及标准的图像质量和偏见指标,我们证明PopAlign显著减轻了预训练T2I模型的偏见,同时在很大程度上保持了生成质量。代码可在 https://github.com/jacklishufan/PopAlignSDXL 获取。