Aligning large language models (LLMs) with human preferences is critical for their deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play technique that requires no fine-tuning of model parameters. However, generating text that achieves both high reward and high likelihood remains a significant challenge. Existing methods often fail to generate high-reward text or incur substantial computational costs. In this paper, we propose Cascade Reward Sampling (CARDS) to address both issues, guaranteeing the generation of high-reward and high-likelihood text with significantly low costs. Based on our analysis of reward models (RMs) on incomplete text and our observation that high-reward prefixes induce high-reward complete text, we use rejection sampling to iteratively generate small semantic segments to form such prefixes. The segment length is dynamically determined by the predictive uncertainty of LLMs. This strategy guarantees desirable prefixes for subsequent generations and significantly reduces wasteful token re-generations and the number of reward model scoring. Our experiments demonstrate substantial gains in both generation efficiency and alignment ratings compared to the baselines, achieving five times faster text generation and 99\% win-ties in GPT-4/Claude-3 helpfulness evaluation.
翻译:将大型语言模型(LLMs)与人类偏好对齐对其实际部署至关重要。近年来,解码时对齐技术作为一种无需调整模型参数即可实现的即插即用方法崭露头角。然而,生成同时具备高奖励值和高似然度的文本仍面临重大挑战。现有方法往往难以生成高奖励文本,或需承担高昂的计算成本。本文提出级联奖励采样(CARDS)方法以同时解决这两个问题,该方法能以极低的成本保证生成高奖励、高似然度的文本。基于对奖励模型(RMs)在不完整文本上表现的分析,以及我们观察到的高奖励前缀会引导生成高奖励完整文本的现象,我们采用拒绝采样方法迭代生成小型语义片段以构建此类前缀。片段的长度由LLMs的预测不确定性动态决定。该策略保证了后续生成所需的高质量前缀,并显著减少了无效的令牌再生和奖励模型评分次数。实验结果表明,与基线方法相比,本方法在生成效率和对齐评分方面均取得显著提升,实现了五倍的文本生成加速,并在GPT-4/Claude-3的有用性评估中取得了99%的胜率-平局率。