Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
翻译:基于奖励模型的人类反馈强化学习(RLHF)已推动生成模型在人类审美与感知偏好对齐方面取得进展。然而,联合优化多个奖励通常会产生对齐税,即在改善某一维度的同时导致其他维度性能下降。为解决此问题,我们引入了两种互补方法:MapReduce LoRA 与奖励感知词元嵌入(RaTE)。MapReduce LoRA 并行训练针对特定偏好的LoRA专家模型,并通过迭代合并来优化共享基础模型;RaTE则学习奖励特定的词元嵌入,在推理时通过组合实现灵活的偏好控制。在文生图任务(Stable Diffusion 3.5 Medium 与 FLUX.1-dev)上的实验表明,模型在GenEval、PickScore与OCR指标上分别提升36.1%、4.6%、55.7%以及32.7%、4.3%、67.1%。在文生视频任务(HunyuanVideo)中,视觉质量与运动质量分别提升48.1%与90.0%。在语言任务(Helpful Assistant)中,基于Llama-2 7B模型的助益性与无害性分别提升43.4%与136.7%。本框架为跨模态的多偏好对齐确立了新的技术标杆。