Language models trained on large amounts of data are known to produce inappropriate content in some cases and require careful tuning to be used in the real world. We revisit the reward augmented decoding (RAD) approach to control the generation from a language model using the scores from a task-specific reward model. We investigate the training objective of RAD, and reformulate it as a task of learning a reward matrix. We show that RAD is designed to support high flexibility when representing the reward matrices, which leads to a higher computational costs during decoding. However, we demonstrate that RAD does not use its full flexibility. Motivated by this, we propose a simpler but more efficient low-rank parametrization of the reward model enabling fast and effective guided decoding. For the detoxification and sentiment control tasks, we show that our low-rank reward model performs on par with the more flexible RAD parametrization, while requiring only a single reward model call per generated token.
翻译:在大规模数据上训练的语言模型在某些情况下会产生不当内容,需要经过仔细调整才能在实际应用中使用。本文重新审视了奖励增强解码方法,该方法利用任务特定奖励模型的评分来控制语言模型的生成过程。我们研究了RAD的训练目标,并将其重新表述为学习奖励矩阵的任务。研究表明,RAD在设计上支持奖励矩阵表示的高度灵活性,但这会导致解码过程中的计算成本增加。然而,我们证明RAD并未充分利用其全部灵活性。基于这一发现,我们提出了一种更简单但更高效的奖励模型低秩参数化方法,能够实现快速有效的引导解码。在去毒性和情感控制任务中,我们的低秩奖励模型与更灵活的RAD参数化方法表现相当,同时每个生成词元仅需单次奖励模型调用。