Diffusion models have achieved remarkable progress in text-to-image generation, yet aligning them with human preference remains challenging due to the presence of multiple, sometimes conflicting, evaluation metrics (e.g., semantic consistency, aesthetics, and human preference scores). Existing alignment methods typically optimize for a single metric or rely on scalarized reward aggregation, which can bias the model toward specific evaluation criteria. To address this challenge, we propose BalancedDPO, a framework that achieves multi-metric preference alignment within the Direct Preference Optimization (DPO) paradigm. Unlike prior DPO variants that rely on a single metric, BalancedDPO introduces a majority-vote consensus over multiple preference scorers and integrates it directly into the DPO training loop with dynamic reference model updates. This consensus-based formulation avoids reward-scale conflicts and ensures more stable gradient directions across heterogeneous metrics. Experiments on Pick-a-Pic, PartiPrompt, and HPD datasets demonstrate that BalancedDPO consistently improves preference win rates over the baselines across Stable Diffusion 1.5, Stable Diffusion 2.1 and SDXL backbones. Comprehensive ablations further validate the benefits of majority-vote aggregation and dynamic reference updating, highlighting the method's robustness and generalizability across diverse alignment settings.
翻译:扩散模型在文本到图像生成中取得了显著进展,但由于存在多个且有时相互冲突的评估指标(如语义一致性、美学评分和人类偏好评分),使其与人类偏好对齐仍面临挑战。现有对齐方法通常优化单一指标或依赖标量化奖励聚合,这可能导致模型偏向特定评价标准。为解决该问题,我们提出了BalancedDPO框架,该框架在直接偏好优化(DPO)范式内实现了多指标偏好对齐。与依赖单一指标的先前DPO变体不同,BalancedDPO引入基于多数投票的共识机制,对多个偏好评分器进行集成,并通过动态参考模型更新将其直接融入DPO训练循环中。这种基于共识的公式可避免奖励尺度冲突,并确保跨异构指标的梯度方向更加稳定。在Pick-a-Pic、PartiPrompt和HPD数据集上的实验表明,BalancedDPO在Stable Diffusion 1.5、Stable Diffusion 2.1和SDXL骨干网络上持续优于基线方法的偏好胜率。全面的消融实验进一步验证了多数投票聚合与动态参考更新的优势,突显了该方法在多样化对齐场景中的鲁棒性和泛化能力。