Tell, don't show: Declarative facts influence how LLMs generalize

We examine how large language models (LLMs) generalize from abstract declarative statements in their training data. As an illustration, consider an LLM that is prompted to generate weather reports for London in 2050. One possibility is that the temperatures in the reports match the mean and variance of reports from 2023 (i.e. matching the statistics of pretraining). Another possibility is that the reports predict higher temperatures, by incorporating declarative statements about climate change from scientific papers written in 2023. An example of such a declarative statement is "global temperatures will increase by $1^{\circ} \mathrm{C}$ by 2050". To test the influence of abstract declarative statements, we construct tasks in which LLMs are finetuned on both declarative and procedural information. We find that declarative statements influence model predictions, even when they conflict with procedural information. In particular, finetuning on a declarative statement $S$ increases the model likelihood for logical consequences of $S$. The effect of declarative statements is consistent across three domains: aligning an AI assistant, predicting weather, and predicting demographic features. Through a series of ablations, we show that the effect of declarative statements cannot be explained by associative learning based on matching keywords. Nevertheless, the effect of declarative statements on model likelihoods is small in absolute terms and increases surprisingly little with model size (i.e. from 330 million to 175 billion parameters). We argue that these results have implications for AI risk (in relation to the "treacherous turn") and for fairness.

翻译：我们研究了大语言模型（LLMs）如何从其训练数据中的抽象陈述性语句进行泛化。举例而言，考虑一个被提示生成2050年伦敦天气报告的LLM。一种可能性是报告中的温度与2023年报告的均值和方差一致（即匹配预训练数据的统计特征）。另一种可能性是，报告通过纳入2023年科学论文中关于气候变化的陈述性语句（例如“全球温度将在2050年前上升$1^{\circ} \mathrm{C}$”），预测出更高的温度。为检验抽象陈述性语句的影响，我们构建了同时对LLM进行陈述性与程序性信息微调的任务。研究发现，即便陈述性语句与程序性信息相冲突，它们仍会影响模型预测。具体而言，对陈述性语句$S$进行微调会提升模型对$S$逻辑推论的对数似然。这种陈述性语句的影响在三个领域（对齐AI助手、预测天气、预测人口特征）中具有一致性。通过一系列消融实验，我们证明陈述性语句的效果无法通过基于关键词匹配的联想学习来解释。然而，陈述性语句对模型对数似然的绝对影响较小，且随模型规模（从3.3亿到1750亿参数）增长的程度出奇地有限。我们论证这些结果对AI风险（与“背叛性转向”相关）及公平性具有启示意义。