Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.
翻译:摘要:强化学习后训练已成为将生成模型与人类偏好对齐的标准范式,然而多数方法仍依赖单一标量奖励。当面临多重准则时,当前主流的"早期标量化"方法将奖励合并为固定加权和,这迫使模型在训练阶段固守单一权衡点,无法在推理阶段对本质冲突的目标(如图像编辑中的提示保真度与源图忠实度)进行动态控制。我们提出ParetoSlider——一种多目标强化学习框架,通过训练单一扩散模型逼近完整帕累托前沿。该方法以连续变化的偏好权重作为条件信号进行模型训练,使用户无需重新训练或维护多个检查点,即可在推理阶段导航最优权衡。我们在三种最先进的流匹配骨干网络(SD3.5、FluxKontext和LTX-2)上评估了ParetoSlider。我们基于单一偏好条件训练的模型,其性能达到或超过针对固定奖励权衡单独训练的基线模型,同时具有对竞争性生成目标的细粒度调控能力这一独特优势。