Aligning Large Language Models (LLMs) traditionally relies on costly training and human preference annotations. Self-alignment seeks to reduce these expenses by enabling models to align themselves. To further lower costs and achieve alignment without any expensive tuning or annotations, we introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (DRPO). Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions, all without additional training or human intervention. The core of DRPO is a dynamic rewarding mechanism, which identifies and rectifies model-specific alignment weaknesses, allowing LLMs to adapt efficiently to diverse alignment challenges. Empirical evaluations on eight recent LLMs, both open- and closed-sourced, demonstrate that DRPO significantly enhances alignment performance, with base models outperforming their SFT/RLHF-tuned counterparts. Moreover, the prompts automatically optimized by DRPO surpass those curated by human experts, further validating the effectiveness of our approach. Our findings highlight the great potential of current LLMs to achieve adaptive self-alignment through inference-time optimization, complementing tuning-based alignment methods.
翻译:传统上,对齐大型语言模型(LLMs)通常依赖于成本高昂的训练和人类偏好标注。自对齐旨在通过使模型能够自我对齐来降低这些开销。为了进一步降低成本并在无需任何昂贵调优或标注的情况下实现对齐,我们提出了一种新的免调优自对齐方法:基于动态奖励与提示优化的方法(DRPO)。我们的方法利用基于搜索的优化框架,使LLMs能够迭代式自我改进并构建最优的对齐指令,整个过程无需额外训练或人工干预。DRPO的核心是一个动态奖励机制,该机制能够识别并纠正模型特定的对齐弱点,使LLMs能够高效适应多样化的对齐挑战。对八个近期开源与闭源LLMs的实证评估表明,DRPO显著提升了对齐性能,基础模型的表现甚至超越了经过监督微调(SFT)/基于人类反馈的强化学习(RLHF)调优的对应模型。此外,由DRPO自动优化的提示词优于人类专家精心设计的提示词,进一步验证了我们方法的有效性。我们的研究结果突显了当前LLMs通过推理时优化实现自适应自对齐的巨大潜力,为基于调优的对齐方法提供了有力补充。