The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.
翻译:大型语言模型(LLM)的广泛应用和日益普及引发了对齐LLM响应与用户及利益相关者偏好的需求。目前已提出许多偏好优化方法,通过微调LLM参数来实现良好对齐。然而,此类参数调优已知会干扰模型在多项任务上的性能表现。此外,在此类场景中跟进不断变化的用户偏好也颇为棘手。采用奖励模型指导的解码时对齐方法虽以增加推理时间为代价,但能解决上述问题。然而,此类方法大多未能恰当平衡奖励的探索与利用——这通常源于对这两个方面的混淆表述——从而难以生成充分对齐的响应。为改进此问题,我们将这两个方面解耦并以进化方式实现:通过变异指令的解码强制实施探索,而将周期性替换低奖励生成结果为高奖励结果作为利用的体现。实证证据表明,该策略在AlpacaEval 2和MT-Bench这两个广泛认可的对齐基准测试中优于多种偏好优化及解码时对齐方法。我们的实现代码将发布于:https://darwin-alignment.github.io。