Many capable large language models (LLMs) are developed via self-supervised pre-training followed by a reinforcement-learning fine-tuning phase, often based on human or AI feedback. During this stage, models may be guided by their inductive biases to rely on simpler features which may be easier to extract, at a cost to robustness and generalisation. We investigate whether principles governing inductive biases in the supervised fine-tuning of LLMs also apply when the fine-tuning process uses reinforcement learning. Following Lovering et al (2021), we test two hypotheses: that features more $\textit{extractable}$ after pre-training are more likely to be utilised by the final policy, and that the evidence for/against a feature predicts whether it will be utilised. Through controlled experiments on synthetic and natural language tasks, we find statistically significant correlations which constitute strong evidence for these hypotheses.
翻译:许多能力强大的大型语言模型(LLMs)通过自监督预训练及后续基于人类或AI反馈的强化学习微调阶段开发。在此阶段,模型可能受归纳偏置引导,倾向于依赖更简单且更易提取的特征,从而牺牲鲁棒性和泛化能力。我们探究了监督微调LLMs中归纳偏置的指导原则是否同样适用于强化学习微调过程。遵循Lovering等人(2021)的方法,我们验证了两个假设:预训练后更易提取的特征更可能被最终策略利用,以及特征存在/缺失的证据能预测其是否被利用。通过在合成与自然语言任务上的受控实验,我们发现具有统计显著性的相关性,这强有力支持了上述假设。