We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting, but can also be combined with either. The idea is to learn a simple linear function on a model's embedding space that can be used to reweight candidate completions. We theoretically show that this sampling procedure is equivalent to a KL-constrained maximization of the Q-probe as the number of samples increases. To train the Q-probes we consider either reward modeling or a class of novel direct policy learning objectives based on importance weighted policy gradients. With this technique, we see gains in domains with ground-truth rewards (code generation) as well as implicit rewards defined by preference data, even outperforming finetuning in data-limited regimes. Moreover, a Q-probe can be trained on top of an API since it only assumes access to sampling and embeddings. Code: https://github.com/likenneth/q_probe .
翻译:我们提出了一种名为Q-probing的方法,用于调整预训练语言模型以最大化特定任务的奖励函数。在高层次上,Q-probing介于微调等较重的方法与少样本提示等较轻的方法之间,但也可以与这两种方法结合使用。其核心思想是在模型的嵌入空间上学习一个简单的线性函数,用于重新加权候选生成结果。我们从理论上证明,随着样本数量的增加,这种采样过程相当于对Q探针进行KL约束最大化。为了训练Q探针,我们考虑了奖励建模或一类基于重要性加权策略梯度的新型直接策略学习目标。通过这项技术,我们在具有真实奖励的领域(代码生成)以及由偏好数据定义的隐式奖励领域中都取得了收益,甚至在数据受限的情况下超越了微调的性能。此外,由于Q探针仅需访问采样和嵌入功能,因此可在API之上进行训练。代码:https://github.com/likenneth/q_probe 。