The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.
翻译:大型语言模型的广泛应用和日益普及引发了对齐其响应与用户及利益相关者偏好的需求。目前已提出多种偏好优化方法,通过微调LLM参数来实现良好对齐。然而,此类参数调优已知会干扰模型在多项任务上的性能表现。此外,在此类场景中适应动态变化的用户偏好也颇具挑战。基于奖励模型指导的解码时对齐方法虽以增加推理时间为代价解决了这些问题,但多数此类方法未能恰当平衡奖励探索与利用的关系——这往往源于二者在形式上的混淆——从而难以生成充分对齐的响应。为此,我们将这两个维度解耦并以进化方式实现:通过变异指令解码强制进行探索,同时将低奖励生成周期性地替换为高奖励生成以实现利用。实证研究表明,该策略在AlpacaEval 2和MT-Bench两个广受认可的对齐基准测试中,优于多种偏好优化及解码时对齐方法。我们的实现代码将发布于:https://darwin-alignment.github.io。