The relationship between the quality of a string and its probability $p(\boldsymbol{y})$ under a language model has been influential in the development of techniques to build good text generation systems. For example, several decoding algorithms have been motivated to manipulate $p(\boldsymbol{y})$ to produce higher-quality text. In this work, we examine the probability--quality relationship in language models explicitly aligned to human preferences, e.g., through Reinforcement Learning through Human Feedback (RLHF). We find that, given a general language model and its aligned version, for corpora sampled from an aligned language model, there exists a trade-off between the average reward and average log-likelihood of the strings under the general language model. We provide a formal treatment of this issue and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
翻译:字符串质量与其在语言模型下的概率$p(\boldsymbol{y})$之间的关系,对构建优质文本生成系统的技术发展产生了重要影响。例如,多种解码算法的设计动机便是通过操控$p(\boldsymbol{y})$来生成更高质量的文本。在本工作中,我们研究了经过显式人类偏好对齐(例如通过人类反馈强化学习(RLHF))的语言模型中概率与质量的关系。我们发现,给定一个基础语言模型及其对齐版本,对于从对齐语言模型中采样的语料,字符串的平均奖励与在基础语言模型下的平均对数似然之间存在一种权衡关系。我们对此问题进行了形式化分析,并论证了采样适配器的选择如何允许我们在似然与奖励之间进行权衡取舍。